Annotations and annotation progress

Annotations
Annotation attributes and values
Simple span annotations vs. complex annotations
Annotation sets and categories
Admin annotations, SEGMENTs, and annotation progress

Annotations

The basic unit of document enrichment in MAT is the annotation. There are two types of annotations in MAT: span annotations and spanless annotations. Span annotations are anchored to a particular contiguous span in the document, and make some implicit assertion about it; e.g., the span from character 10 to character 15 is a noun phrase. Spanless annotations are not anchored to a span, and are used to make assertions about the entire document, or to make assertions about other annotations; e.g., annotation 1 and annotation 2 refer to the same entity.

The task specification for each task contains one or more annotation set descriptors, which define the annotations available in the task. Simple tasks will typically have only one human-annotatable annotation set descriptor, but MAT 3.0 supports complex tasks where you can define more than one such descriptor. For details, see the section on annotation sets and categories below.

Annotation attributes and values

In MAT, annotations have labels, and may have attributes and values associated with them. Each attribute has a type and an aggregation.

The types are:

"string": this is the default type. The value of such an attribute must be a string. When specifying this type, you can provide a default; specify that the default is the spanned text; or provide an exhaustive list of choices.
"int": the value of this attribute must be an integer. When specifying this type, you can provide a default; specify that the default is the spanned text (if it can be coerced to an integer); provide an exhaustive list of choices; or provide minimum and/or maximum values.
"float": the value of this attribute must be a float. When specifying this type, you can provide a default; specify that the default is the spanned text (if it can be coerced to a float); or provide minimum and/or maximum values.
"boolean": the value of this attribute must be a boolean (Java boolean, JavaScript true/false, Python True/False). When specifying this type, you can provide a default.
"annotation": the value of this attribute must be an annotation. When specifying this type, you can specify a set of label restrictions, each consisting of a label or a label and some attribute/value pairs; any filler of this attribute must satisfy one of the label restrictions.

See here for examples of defining attributes of various types.

It is possible for any attribute to have a null value (Java and JavaScript null, Python None), or not to be present. These conditions are intended to be equivalent, and the various document libraries mostly treat them that way; however, there may be times in Python where an attribute which has never been set will raise a KeyError.

The aggregations are:

"none" or null: the default. A singleton value.
"list": a list of values of the type specified. Null list elements are not permitted.
"set": a set of values of the type specified. Null set elements are not permitted.

MAT does not currently limit the possible combinations of these types and aggregations, even if some of them are nonsensical (e.g., a set of booleans). However, the annotation UI has not been enriched with the ability to edit all 15 possible combinations; some of the less common ones (e.g., sets of strings which are not restricted by a choice list) have not been implemented yet. We hope to support all of them, eventually.

MAT also supports the notion of effective label. An effective label is a notional label (e.g., "PERSON") which is presented to the user in its notional form, but is implemented as an annotation label plus an attribute value of a string attribute which is specified to have an exhaustive list of choices. Labels of this sort are also known as "ENAMEX-style" labels, after the MUC convention of defining "PERSON" as "ENAMEX" + type="PER", and "ORGANIZATION" and "LOCATION" analogously.

Simple span annotations vs. complex annotations

Simple span annotations are span annotations which have either no attributes, or a single attribute/value pair which defines the effective label. This notion is important because it is the class of annotations which the jCarafe CRF engine can build models for. For a discussion of how MAT's capabilities are limited when only the jCarafe engine is available, see here.

Annotation sets and categories

Annotation types in MAT are aggregated at two levels: sets, in the form of annotation set descriptor declarations, and categories, which are aggregations of the set descriptors.

Your task will always contain multiple categories of annotations. At the very least, it will contain the administrative annotations which all tasks must have. Your task will probably also inherit annotation categories for tokenization and zoning from the (implicit) root task. But the annotation sets and categories you'll be most interested in are the ones you yourself define, which are the annotations you'll be adding by hand, or using the jCarafe engine to add automatically.. We call these annotations "content annotations", and we'll describe them first, and then talk about the other categories that MAT makes available.

Content annotations

The semantics of the content annotations are defined by the task. For instance, the sample "Named Entity" task is about annotation of named entities, and the three content annotation labels used there have a standardized interpretation in named entity annotation, for which detailed guidelines have been developed about when to assign each of these annotation labels.

Good guidelines are crucial for maximizing agreement among human annotators when preparing gold-standard corpora. When you develop your own task, you should develop similar guidelines; this skill is quite sophisticated, and is outside the scope of this documentation.

By default, content annotation sets are managed, which means that the annotation progress for these sets is tracked using the SEGMENT administrative annotation. While you can disable set management, we don't recommend it; most of MAT is tested only with managed content annotation sets.

In most tasks, you'll only have one annotation set descriptor describing your content annotations. But you might have good reasons to have more than one. For instance, in a complex relation annotation task, you might have multiple training engines, one for spans and one for relations, or you might want to subdivide your hand annotation effort into multiple steps, where all spans must be added (and perhaps reconciled or reviewed) before any relations are added. Another scenario might involve wanting to separate marking span annotations from marking attributes on those annotations. In these cases, you can define multiple annotation set descriptors. You can reference these sets when you create the steps in your task.

When you have multiple annotation set descriptors, you can aggregate them into categories, or assign each of them to its own category. You might want to aggregate them if you want to reference them, e.g., when you call MATReport or MATScore.

Zone annotations

The root task provides the "zone" annotation category and set. Zones correspond to large chunks of the document where annotations can be found. The default zone annotation in MAT is the "zone" label, which has an attribute called "region_type", whose value is typically "body" (it's possible for the jCarafe engine to use the zone attributes as a training feature, but we don't use that capability at the moment). The implementation of your task provides the jCarafe engine with information about the default zone annotation; you can change this implementation (and you must if you have a custom zone annotation) as described in the advanced topics.

The simplest zoning of a document is to assign a single zone of region_type "body" which encompasses the whole document, and there is a zone step available which does this. If you want to get more sophisticated (e.g., exclude HTML or XML tags), you'll have to consult the advanced topics.

Zone tags are also used in the UI to induce the creation of untaggable regions, which are regions that have no zone annotation. These regions are indicated visually by graying out the document text. If the document has any zone annotations at all, the UI will create these regions.

Token annotations

The root task provides the "token" annotation category and set. Tokens correspond, pretty much, to words. In MAT, token annotations are used as the basis for most computation. When you hand-annotate your documents, you are encouraged to require that token annotations are present, so that the MAT annotation tool can determine the possible boundaries of your proposed annotation. This is because the jCarafe trainer and tagger both use tokens (not characters) as the "atoms" of computation, and as a result, any annotation whose boundaries do not coincide with token boundaries must be modified or discarded during the training phase (because the element at the relevant edge doesn't correspond to an "atom" as far as jCarafe is concerned). Similarly, you must use the same automatic tokenization process for training and tagging, for obvious reasons.

jCarafe is one of many trainable annotation tools which rely on tokens as atoms; however, most hand annotation tools don't try to ensure in advance that annotation boundaries match token boundaries, and the users of such tools have to make accommodations later in their workflows.

The MAT toolkit comes with a default English tokenizer, which we describe when we talk about steps. There's nothing special about this tokenizer; you can replace it with your own, as long as you use the same tokenizer throughout your task. If you inherit your token annotations from the core task, and use the default tokenizer, you don't have to think about this any further. If you don't, your tokenizer and task have to make sure of several things:

the task defines an annotation label of category "token"
there is at least one annotation label of this category, named "lex" (jCarafe expects this name)
the tokenizer outputs annotations with these labels
the spans which bear these labels don't cross the boundaries of zone annotations

A sample inventory of annotations

As an example, let's take a look at the annotations in the sample 'Named Entity' task.

Annotation label	Set	Category	Use
lex	token	token	each lex tag delimits a non-whitespace token (the basic element used in the jCarafe trainer and tagger)
zone	zone	zone	delimits a contiguous zone in which content annotations will be found
PERSON	content	content	a proper name of a person
LOCATION	content	content	a proper name of a location
ORGANIZATION	content	content	a proper name of an organization
SEGMENT	admin	admin	administrative info about the progress of annotation (present in all tasks)

The Sample Relations task provides an example of multiple content annotation sets:

Annotation label	Set	Category	Use
lex	token	token	each lex tag delimits a non-whitespace token (the basic element used in the jCarafe trainer and tagger)
zone	zone	zone	delimits a contiguous zone in which content annotations will be found
PERSON	entities	content	a proper name of a person
LOCATION	entities	content	a proper name of a location
ORGANIZATION	entities	content	a proper name of an organization
NATIONALITY	nationality	content	a descriptive adjective for a country (e.g., "Brazilian")
Employment	relations	content	a relationship between employer and employee
Located	relations	content	a relationship between a located entity and a location
SEGMENT	admin	admin	administrative info about the progress of annotation (present in all tasks)

Admin annotations, SEGMENTs, and annotation progress

The final category of annotations, admin annotations, are crucial to the operation of MAT. There is currently only one admin annotation, SEGMENT. If you don't care about the inner bookkeeping that MAT uses, you may skip this section; all you really need to know is that using the SEGMENT admin annotation, MAT can keep detailed track of how the annotation of various portions of a document is progressing.

When content annotation types are managed (which is the default for all non-admin annotation types), it means that the annotation progress for these annotation types is tracked by SEGMENTs.

These admin annotations play a much smaller role in MAT 3.0 than we originally anticipated. We had originally expected MAT 3.0 to contain support for partial hand annotation and correction (e.g., annotating only certain regions of documents). However, we have not been able to build out this partial annotation capability, especially as it might interact with spanless annotations. So in MAT 3.0, SEGMENT annotations cover exactly the same spans as zone annotations, and we don't use them to mark partial annotation (although many of the hooks that might enable that capability in the future remain).

The details

SEGMENTs cover the same spans as zone annotations. (If partial annotation were enabled, they would be a disjoint cover of the zone annotations.) These annotations can capture a number of aspects about the progress of annotation:

the annotation set for which the progress is relevant
the identity of the annotator
the status of the progress

The SEGMENT is a span annotation. It has three attributes, "set", "annotator" and "status". The value of the "set" attribute is the name of the annotation set for which the progress is relevant. The other two attributes occur in the following configurations of values:

Value of the annotator attribute	Value of the status attribute	What it means
(null)	"non-gold"	The segment is untouched for the given set. No human or automated annotator has modified it.
"MACHINE"	"non-gold"	The segment has been annotated for the given set by an automated tool, but no human has marked it as completed.
(a user's name)	"non-gold"	A user "owns" this segment and has modified the annotations for the given set in some way, but is not prepared to mark it as completed. If a user corrects a set which an automated tool has annotated, the user now "owns" the segment.
(a user's name)	"human gold"	A user "owns" the segment and has marked it as complete.
(a user's name)	"reconciled"	A user "owns" the segment and has marked it complete, and the segment has been vetted by the reconciliation process.

It's possible, then, to have multiple segments, corresponding to different annotation sets, some of whose segments are gold and some of whose segments are not gold. (If partial annotation were enabled, this would extend to different regions of the document for the same annotation set as well.)

When these set statuses are combined with the steps which add or modify those sets, we can derive the step statuses which we use in the workspaces.

"reconciled"	documents all of whose SEGMENTs for the given step's managed sets have status = "reconciled"
"gold"	documents all of whose SEGMENTs for the given step's managed sets have status = "reconciled" or status = "human gold" (and at least one "human gold" SEGMENT)
"partially gold"	documents some (but not all) of whose SEGMENTs for the given step's managed sets have status = "reconciled" or status = "human gold". This status will not be encountered, since MAT 3.0 does not make use of partial annotation.
"uncorrected"	documents all of whose SEGMENTs for the given step's managed sets have annotator = "MACHINE"
"partially corrected"	documents some of whose SEGMENTs for the given step's managed sets have annotator != "MACHINE" and annotator != null (i.e., they've been touched by a human annotator)
"unannotated"	documents which have no content annotations for any of the given step's managed sets and no SEGMENTs for the given step's managed sets which are owned by any annotator

All the relevant MAT tools are aware of the SEGMENT annotations.

All steps ensure that when the step is applied, corresponding SEGMENT annotations for the steps' managed sets (if any) are guaranteed to exist.
The jCarafe engine, by default, will only insert annotations into SEGMENTs for the relevant annotation set whose annotator = "MACHINE" or annotator = null. If no segments are found, it will insert annotations into all the zones; if no zones are found, it will insert annotations into the entire document. Any SEGMENT into which annotations are inserted will be marked annotator = "MACHINE".
The jCarafe engine, by default, will train on all SEGMENTs for the relevant annotation set which have been which have been touched by a human annotator, whether or not they're gold (see, however, the --partial_training_on_gold_only option). If no SEGMENTs are found, it will use all the zones; if no zones are found, it will use the entire document. The scope of the training is important; any blank regions are treated as implicitly negative information (i.e., the trainer will conclude that there's no annotation there on purpose).
The MAT UI ensures that each document that's going to be annotated for a given managed set has at least one SEGMENT per zone for that set. It uses "unknown human" as the default annotator. As soon as the human annotator adds, removes or modifies an annotation, the enclosing SEGMENT annotation is marked annotator = "unknown human". The "mark gold" step in the UI marks all SEGMENTs in the document status = "human gold".