Annotations and annotation progress

Annotations

The basic unit of document enrichment in MAT is the annotation. There are two types of annotations in MAT: span annotations and spanless annotations. Span annotations are anchored to a particular contiguous span in the document, and make some implicit assertion about it; e.g., the span from character 10 to character 15 is a noun phrase. Spanless annotations are not anchored to a span, and are used to make assertions about the entire document, or to make assertions about other annotations; e.g., annotation 1 and annotation 2 refer to the same entity.

The task specification for each task contains one or more annotation set descriptors, which define the annotations available in the task. Simple tasks will typically have only one human-annotatable annotation set descriptor, but MAT 3.0 supports complex tasks where you can define more than one such descriptor. For details, see the section on annotation sets and categories below.

Annotation attributes and values

In MAT, annotations have labels, and may have attributes and values associated with them. Each attribute has a type and an aggregation.

The types are:

See here for examples of defining attributes of various types.

It is possible for any attribute to have a null value (Java and JavaScript null, Python None), or not to be present. These conditions are intended to be equivalent, and the various document libraries mostly treat them that way; however, there may be times in Python where an attribute which has never been set will raise a KeyError.

The aggregations are:

MAT does not currently limit the possible combinations of these types and aggregations, even if some of them are nonsensical (e.g., a set of booleans). However, the annotation UI has not been enriched with the ability to edit all 15 possible combinations; some of the less common ones (e.g., sets of strings which are not restricted by a choice list) have not been implemented yet. We hope to support all of them, eventually.

MAT also supports the notion of effective label. An effective label is a notional label (e.g., "PERSON") which is presented to the user in its notional form, but is implemented as an annotation label plus an attribute value of a string attribute which is specified to have an exhaustive list of choices. Labels of this sort are also known as "ENAMEX-style" labels, after the MUC convention of defining "PERSON" as "ENAMEX" + type="PER", and "ORGANIZATION" and "LOCATION" analogously.

Simple span annotations vs. complex annotations

Simple span annotations are span annotations which have either no attributes, or a single attribute/value pair which defines the effective label. This notion is important because it is the class of annotations which the jCarafe CRF engine can build models for. For a discussion of how MAT's capabilities are limited when only the jCarafe engine is available, see here.

Annotation sets and categories

Annotation types in MAT are aggregated at two levels: sets, in the form of annotation set descriptor declarations, and categories, which are aggregations of the set descriptors.

Your task will always contain multiple categories of annotations. At the very least, it will contain the administrative annotations which all tasks must have. Your task will probably also inherit annotation categories for tokenization and zoning from the (implicit) root task. But the annotation sets and categories you'll be most interested in are the ones you yourself define, which are the annotations you'll be adding by hand, or using the jCarafe engine to add automatically.. We call these annotations "content annotations", and we'll describe them first, and then talk about the other categories that MAT makes available.

Content annotations

The semantics of the content annotations are defined by the task. For instance, the sample "Named Entity" task is about annotation of named entities, and the three content annotation labels used there have a standardized interpretation in named entity annotation, for which detailed guidelines have been developed about when to assign each of these annotation labels.

Good guidelines are crucial for maximizing agreement among human annotators when preparing gold-standard corpora. When you develop your own task, you should develop similar guidelines; this skill is quite sophisticated, and is outside the scope of this documentation.

By default, content annotation sets are managed, which means that the annotation progress for these sets is tracked using the SEGMENT administrative annotation. While you can disable set management, we don't recommend it; most of MAT is tested only with managed content annotation sets.

In most tasks, you'll only have one annotation set descriptor describing your content annotations. But you might have good reasons to have more than one. For instance, in a complex relation annotation task, you might have multiple training engines, one for spans and one for relations, or you might want to subdivide your hand annotation effort into multiple steps, where all spans must be added (and perhaps reconciled or reviewed) before any relations are added. Another scenario might involve wanting to separate marking span annotations from marking attributes on those annotations. In these cases, you can define multiple annotation set descriptors. You can reference these sets when you create the steps in your task.

When you have multiple annotation set descriptors, you can aggregate them into categories, or assign each of them to its own category. You might want to aggregate them if you want to reference them, e.g., when you call MATReport or MATScore.

Zone annotations

The root task provides the "zone" annotation category and set. Zones correspond to large chunks of the document where annotations can be found. The default zone annotation in MAT is the "zone" label, which has an attribute called "region_type", whose value is typically "body" (it's possible for the jCarafe engine to use the zone attributes as a training feature, but we don't use that capability at the moment). The implementation of your task provides the jCarafe engine with information about the default zone annotation; you can change this implementation (and you must if you have a custom zone annotation) as described in the advanced topics.

The simplest zoning of a document is to assign a single zone of region_type "body" which encompasses the whole document, and there is a zone step available which does this. If you want to get more sophisticated (e.g., exclude HTML or XML tags), you'll have to consult the advanced topics.

Zone tags are also used in the UI to induce the creation of untaggable regions, which are regions that have no zone annotation. These regions are indicated visually by graying out the document text. If the document has any zone annotations at all, the UI will create these regions.

Token annotations

The root task provides the "token" annotation category and set. Tokens correspond, pretty much, to words. In MAT, token annotations are used as the basis for most computation. When you hand-annotate your documents, you are encouraged to require that token annotations are present, so that the MAT annotation tool can determine the possible boundaries of your proposed annotation. This is because the jCarafe trainer and tagger both use tokens (not characters) as the "atoms" of computation, and as a result, any annotation whose boundaries do not coincide with token boundaries must be modified or discarded during the training phase (because the element at the relevant edge doesn't correspond to an "atom" as far as jCarafe is concerned). Similarly, you must use the same automatic tokenization process for training and tagging, for obvious reasons.

jCarafe is one of many trainable annotation tools which rely on tokens as atoms; however, most hand annotation tools don't try to ensure in advance that annotation boundaries match token boundaries, and the users of such tools have to make accommodations later in their workflows.

The MAT toolkit comes with a default English tokenizer, which we describe when we talk about steps. There's nothing special about this tokenizer; you can replace it with your own, as long as you use the same tokenizer throughout your task. If you inherit your token annotations from the core task, and use the default tokenizer, you don't have to think about this any further. If you don't, your tokenizer and task have to make sure of several things:

A sample inventory of annotations

As an example, let's take a look at the annotations in the sample 'Named Entity' task.

Annotation label
Set
Category
Use
lex
token
token
each lex tag delimits a non-whitespace token (the basic element used in the jCarafe trainer and tagger)
zone
zone
zone
delimits a contiguous zone in which content annotations will be found
PERSON
content
content
a proper name of a person
LOCATION
content
content
a proper name of a location
ORGANIZATION
content
content
a proper name of an organization
SEGMENT
admin
admin
administrative info about the progress of annotation (present in all tasks)

The Sample Relations task provides an example of multiple content annotation sets:

Annotation label
Set
Category
Use
lex
token
token
each lex tag delimits a non-whitespace token (the basic element used in the jCarafe trainer and tagger)
zone
zone
zone
delimits a contiguous zone in which content annotations will be found
PERSON
entities content
a proper name of a person
LOCATION
entities content
a proper name of a location
ORGANIZATION
entities
content
a proper name of an organization
NATIONALITY
nationality
content
a descriptive adjective for a country (e.g., "Brazilian")
Employment
relations
content
a relationship between employer and employee
Located
relations
content
a relationship between a located entity and a location
SEGMENT
admin
admin
administrative info about the progress of annotation (present in all tasks)

Admin annotations, SEGMENTs, and annotation progress

The final category of annotations, admin annotations, are crucial to the operation of MAT. There is currently only one admin annotation, SEGMENT. If you don't care about the inner bookkeeping that MAT uses, you may skip this section; all you really need to know is that using the SEGMENT admin annotation, MAT can keep detailed track of how the annotation of various portions of a document is progressing.

When content annotation types are managed (which is the default for all non-admin annotation types), it means that the annotation progress for these annotation types is tracked by SEGMENTs.

These admin annotations play a much smaller role in MAT 3.0 than we originally anticipated. We had originally expected MAT 3.0 to contain support for partial hand annotation and correction (e.g., annotating only certain regions of documents). However, we have not been able to build out this partial annotation capability, especially as it might interact with spanless annotations. So in MAT 3.0, SEGMENT annotations cover exactly the same spans as zone annotations, and we don't use them to mark partial annotation (although many of the hooks that might enable that capability in the future remain).

The details

SEGMENTs cover the same spans as zone annotations. (If partial annotation were enabled, they would be a disjoint cover of the zone annotations.) These annotations can capture a number of aspects about the progress of annotation:

The SEGMENT is a span annotation. It has three attributes, "set", "annotator" and "status". The value of the "set" attribute is the name of the annotation set for which the progress is relevant. The other two attributes occur in the following configurations of values:

Value of the annotator attribute
Value of the status attribute
What it means
(null)
"non-gold"
The segment is untouched for the given set. No human or automated annotator has modified it.
"MACHINE"
"non-gold"
The segment has been annotated for the given set by an automated tool, but no human has marked it as completed.
(a user's name)
"non-gold"
A user "owns" this segment and has modified the annotations for the given set in some way, but is not prepared to mark it as completed. If a user corrects a set which an automated tool has annotated, the user now "owns" the segment.
(a user's name)
"human gold"
A user "owns" the segment and has marked it as complete.
(a user's name)
"reconciled"
A user "owns" the segment and has marked it complete, and the segment has been vetted by the reconciliation process.

It's possible, then, to have multiple segments, corresponding to different annotation sets, some of whose segments are gold and some of whose segments are not gold. (If partial annotation were enabled, this would extend to different regions of the document for the same annotation set as well.)

When these set statuses are combined with the steps which add or modify those sets, we can derive the step statuses which we use in the workspaces.

"reconciled"
documents all of whose SEGMENTs for the given step's managed sets have status = "reconciled"
"gold"
documents all of whose SEGMENTs for the given step's managed sets have status = "reconciled" or status = "human gold" (and at least one "human gold" SEGMENT)
"partially gold"
documents some (but not all) of whose SEGMENTs for the given step's managed sets have status = "reconciled" or status = "human gold". This status will not be encountered, since MAT 3.0 does not make use of partial annotation.
"uncorrected"
documents all of whose SEGMENTs for the given step's managed sets have annotator = "MACHINE"
"partially corrected"
documents some of whose SEGMENTs for the given step's managed sets have annotator != "MACHINE" and annotator != null (i.e., they've been touched by a human annotator)
"unannotated"
documents which have no content annotations for any of the given step's managed sets and no SEGMENTs for the given step's managed sets which are owned by any annotator

All the relevant MAT tools are aware of the SEGMENT annotations.