The basic unit of document enrichment in MAT is the annotation. There are two
types of annotations in MAT: span
annotations and spanless
annotations. Span annotations are anchored to a
particular contiguous span in the document, and make some implicit
assertion about it; e.g., the span from character 10 to character
15 is a noun phrase. Spanless annotations are not anchored to a
span, and are used to make assertions about the entire document,
or to make assertions about other annotations; e.g., annotation 1
and annotation 2 refer to the same entity.
The task specification for each task contains one or more annotation set descriptors, which
define the annotations available in the task. Simple tasks will
typically have only one human-annotatable annotation set
descriptor, but MAT 3.0 supports complex tasks where you can
define more than one such descriptor. For details, see the section
on annotation sets and
categories below.
In MAT, annotations have labels, and may have attributes and
values associated with them. Each attribute has a type and
an aggregation.
The types are:
See here
for examples of defining attributes of various types.
It is possible for any attribute to have a null value (Java and JavaScript null, Python None), or not to be present. These conditions are intended to be equivalent, and the various document libraries mostly treat them that way; however, there may be times in Python where an attribute which has never been set will raise a KeyError.
The aggregations are:
MAT does not currently limit the possible combinations of these
types and aggregations, even if some of them are nonsensical
(e.g., a set of booleans). However, the annotation UI has not been
enriched with the ability to edit all 15 possible combinations;
some of the less common ones (e.g., sets of strings which are not
restricted by a choice list) have not been implemented yet. We
hope to support all of them, eventually.
MAT also supports the notion of effective label. An
effective label is a notional label (e.g., "PERSON") which is
presented to the user in its notional form, but is implemented as
an annotation label plus an attribute value of a string attribute
which is specified to have an exhaustive list of choices. Labels
of this sort are also known as "ENAMEX-style" labels, after the
MUC convention of defining "PERSON" as "ENAMEX" + type="PER", and
"ORGANIZATION" and "LOCATION" analogously.
Simple span annotations are span annotations which have
either no attributes, or a single attribute/value pair which
defines the effective label. This notion is important because it
is the class of annotations which the jCarafe CRF engine can build
models for. For a discussion of how MAT's capabilities are limited
when only the jCarafe engine is available, see here.
Annotation types in MAT are aggregated at two levels: sets, in
the form of annotation set
descriptor declarations, and categories, which are
aggregations of the set descriptors.
Your task will always contain multiple categories of annotations.
At the very least, it will contain the administrative annotations
which all tasks must have. Your task will probably also inherit
annotation categories for tokenization and zoning from the
(implicit) root task. But the annotation sets and categories
you'll be most interested in are the ones you yourself define,
which are the annotations you'll be adding by hand, or using the jCarafe engine to add
automatically.. We call these annotations "content annotations",
and we'll describe them first, and then talk about the other
categories that MAT makes available.
The semantics of the content annotations are defined by the task.
For instance, the sample "Named Entity" task is about annotation
of named entities, and the three content annotation labels used
there have a standardized interpretation in named entity
annotation, for which detailed guidelines have been developed
about when to assign each of these annotation labels.
Good guidelines are crucial for maximizing agreement among human
annotators when preparing gold-standard corpora. When you develop
your own task, you should develop similar guidelines; this skill
is quite sophisticated, and is outside the scope of this
documentation.
By default, content annotation sets are managed, which
means that the annotation progress for these sets is tracked using
the SEGMENT
administrative annotation. While you can disable set
management, we don't recommend it; most of MAT is tested only with
managed content annotation sets.
In most tasks, you'll only have one annotation set descriptor
describing your content annotations. But you might have good
reasons to have more than one. For instance, in a complex relation
annotation task, you might have multiple training engines, one for
spans and one for relations, or you might want to subdivide your
hand annotation effort into multiple steps, where all spans must
be added (and perhaps reconciled or reviewed) before any relations
are added. Another scenario might involve wanting to separate
marking span annotations from marking attributes on those
annotations. In these cases, you can define multiple annotation
set descriptors. You can reference these sets when you create
the steps in your task.
When you have multiple annotation set descriptors, you can
aggregate them into categories, or assign each of them to its own
category. You might want to aggregate them if you want to
reference them, e.g., when you call MATReport
or MATScore.
The root task provides the "zone" annotation category and set. Zones correspond to large chunks of the document where annotations can be found. The default zone annotation in MAT is the "zone" label, which has an attribute called "region_type", whose value is typically "body" (it's possible for the jCarafe engine to use the zone attributes as a training feature, but we don't use that capability at the moment). The implementation of your task provides the jCarafe engine with information about the default zone annotation; you can change this implementation (and you must if you have a custom zone annotation) as described in the advanced topics.
The simplest zoning of a document is to assign a single zone of
region_type "body" which encompasses the whole document, and there
is a zone step available which does this. If you want to get more
sophisticated (e.g., exclude HTML or XML tags), you'll have to
consult the advanced
topics.
Zone tags are also used in the UI to induce the creation of
untaggable regions, which are regions that have no zone
annotation. These regions are indicated visually by graying out
the document text. If the document has any zone annotations at
all, the UI will create these regions.
The root task provides the "token" annotation category and set.
Tokens correspond, pretty much, to words. In MAT, token
annotations are used as the basis for most computation. When you
hand-annotate your documents, you are encouraged to require that
token annotations are present, so that the MAT annotation tool can
determine the possible boundaries of your proposed annotation.
This is because the jCarafe trainer and tagger both use tokens
(not characters) as the "atoms" of computation, and as a result,
any annotation whose boundaries do not coincide with token
boundaries must be modified or discarded during the training phase
(because the element at the relevant edge doesn't correspond to an
"atom" as far as jCarafe is concerned). Similarly, you must use
the same automatic tokenization process for training and tagging,
for obvious reasons.
jCarafe is one of many trainable annotation tools which rely on
tokens as atoms; however, most hand annotation tools don't try to
ensure in advance that annotation boundaries match token
boundaries, and the users of such tools have to make
accommodations later in their workflows.
The MAT toolkit comes with a default English tokenizer, which we
describe when we talk about steps.
There's nothing special about this tokenizer; you can replace it
with your own, as long as you use the same tokenizer throughout
your task. If you inherit your token annotations from the core
task, and use the default tokenizer, you don't have to think about
this any further. If you don't, your tokenizer and task have to
make sure of several things:
As an example, let's take a look at the annotations in the sample 'Named Entity' task.
Annotation label |
Set |
Category |
Use |
---|---|---|---|
lex |
token |
token |
each lex tag delimits a
non-whitespace token (the basic element used in the jCarafe
trainer and tagger) |
zone |
zone |
zone |
delimits a contiguous zone in
which content annotations will be found |
PERSON |
content |
content |
a proper name of a person |
LOCATION |
content |
content |
a proper name of a location |
ORGANIZATION |
content |
content |
a proper name of an
organization |
SEGMENT |
admin |
admin |
administrative info about the
progress of annotation (present in all tasks) |
The Sample Relations
task provides an example of multiple content annotation sets:
Annotation label |
Set |
Category |
Use |
---|---|---|---|
lex |
token |
token |
each lex tag delimits a
non-whitespace token (the basic element used in the jCarafe
trainer and tagger) |
zone |
zone |
zone |
delimits a contiguous zone in
which content annotations will be found |
PERSON |
entities | content |
a proper name of a person |
LOCATION |
entities | content |
a proper name of a location |
ORGANIZATION |
entities |
content |
a proper name of an
organization |
NATIONALITY |
nationality |
content |
a descriptive adjective for a country (e.g.,
"Brazilian") |
Employment |
relations |
content |
a relationship between employer and employee |
Located |
relations |
content |
a relationship between a located entity and a
location |
SEGMENT |
admin |
admin |
administrative info about the
progress of annotation (present in all tasks) |
The final category of annotations, admin annotations, are crucial to the operation of
MAT. There is currently only one admin annotation, SEGMENT. If you
don't care about the inner bookkeeping that MAT uses, you may skip
this section; all you really need to know is that using the
SEGMENT admin annotation, MAT can keep detailed track of how the
annotation of various portions of a document is progressing.
When content annotation types are managed (which is the default
for all non-admin annotation types), it means that the annotation
progress for these annotation types is tracked by SEGMENTs.
These admin annotations play a much smaller role in MAT 3.0
than we originally anticipated. We had originally expected MAT
3.0 to contain support for partial hand annotation and
correction (e.g., annotating only certain regions of documents).
However, we have not been able to build out this partial
annotation capability, especially as it might interact with
spanless annotations. So in MAT 3.0, SEGMENT annotations cover
exactly the same spans as zone annotations, and we don't use
them to mark partial annotation (although many of the hooks that
might enable that capability in the future remain).
The SEGMENT is a span annotation. It has three attributes, "set",
"annotator" and "status". The value of the "set" attribute is the
name of the annotation set for which the progress is relevant. The
other two attributes occur in the following configurations of
values:
Value of the annotator
attribute |
Value of the status attribute |
What it means |
---|---|---|
(null) |
"non-gold" |
The segment is untouched for
the given set. No human or automated annotator has modified
it. |
"MACHINE" |
"non-gold" |
The segment has been
annotated for the given set by an automated tool, but no
human has marked it as completed. |
(a user's name) |
"non-gold" |
A user "owns" this segment
and has modified the annotations for the given set in some
way, but is not prepared to mark it as completed. If a user
corrects a set which an automated tool has annotated, the
user now "owns" the segment. |
(a user's name) |
"human gold" |
A user "owns" the segment and
has marked it as complete. |
(a user's name) |
"reconciled" |
A user "owns" the segment and
has marked it complete, and the segment has been vetted by
the reconciliation process. |
It's possible, then, to have multiple segments, corresponding to
different annotation sets, some of whose segments are gold and
some of whose segments are not gold. (If partial annotation were
enabled, this would extend to different regions of the document
for the same annotation set as well.)
When these set statuses are combined with the steps which add or modify those sets, we can derive the step statuses which we use in the workspaces.
"reconciled" |
documents all of whose SEGMENTs for the given step's managed sets have status = "reconciled" |
"gold" |
documents all of whose SEGMENTs for the given step's managed sets have status = "reconciled" or status = "human gold" (and at least one "human gold" SEGMENT) |
"partially gold" |
documents some (but not all)
of whose SEGMENTs for the given step's managed sets have
status = "reconciled" or status = "human gold". This status
will not be encountered, since MAT 3.0 does not make use of
partial annotation. |
"uncorrected" |
documents all of whose SEGMENTs for the given step's managed sets have annotator = "MACHINE" |
"partially corrected" |
documents some of whose SEGMENTs for the given step's managed sets have annotator != "MACHINE" and annotator != null (i.e., they've been touched by a human annotator) |
"unannotated" |
documents which have no content annotations for any of the given step's managed sets and no SEGMENTs for the given step's managed sets which are owned by any annotator |
All the relevant MAT tools are aware of the SEGMENT annotations.