Use cases for the XML format for the task files (see "Creating a New Task") are described in
this document. The reference document is found here. Click here for a split-screen view.
Although the annotation types are defined in the task.xml file,
we provide examples for them separately.
Here, we focus on the other elements of the task.xml file.
By default, the labels available to the annotator are listed in
the legend by annotation set, and then alphabetized within each
set. Similarly, in the annotation editor popups, the labels are
alphabetized by default. If you want the labels to appear in the
order they were defined in the task.xml file, you can do this:
<web_customization alphabetize_labels="no"/>
If you want to "brand" your annotation task, you can change the
title of the Web page and the "slug" that appears at the leftmost
edge of the MAT desktop toolbar, if all your installed tasks have
the same branding information (if not, it'll just use the default
MAT branding). You can change the branding like this:
<web_customization>
<short_name>MCAT</short_name>
<long_name>MCAT: My Company's Annotation Tool</long_name>
</web_customization>
You might want to write documentation for your task, and have it
available in the MAT documentation accessible within the MAT UI.
This functionality has been available in MAT for a very long time,
but exposed in a user-exploitable way only in MAT 3.3.
For reasons that will be clear in a moment, it's important to
understand the URL structure of the documentation. The MAT
documentation is available at the path /MAT/doc/html/.
Every task documentation page is available via the path /MAT/doc/html/tasks/<task_dir_name>/<path_within_task_dir>.
So if you place a documentation page within the doc subdirectory,
it'll be at
/MAT/doc/html/tasks/<task_dir_name>/doc/<page>.html.
Keep this in mind.
Your first step is to write your documentation, in HTML, and
place it in your task directory. You can put it anywhere, but the
conventional place to put it is in a subdirectory named "doc". In
order to maintain look-and-feel consistency with the rest of the
documentation, the MAT doc.css file should be included. This
file's path, relative to the server root, is /MAT/doc/css/doc.css,
so, keeping in mind the URL structure above, if your page is
within the doc subdirectory in your task, you'll add the following
CSS link within your <head>:
<link href="../../../../css/doc.css" rel="stylesheet" type="text/css">
In other words, that's four directories up. It's preferable to
define a relative path instead of specifying the absolute path
starting from the server, because if you build a MAT distribution
which includes your task, a static subset of the documentation
(which is intended to be reviewable without a Web server) is
created within the documentation which incorporates the task
documentation, and the URL hierarchy within the static
documentation just happens to preserve this relative structure.
Once you create your documentation, you can have it placed in one
of two positions in the documentation sidebar, using the
<doc_enhancement> element. All the documentation paths you
provide with this mechanism should be relative to the task
directory root.
<doc_enhancement>The "customization" slot is near the top of the sidebar, above "For users". This is the default place to put documentation about your task. You can insert material into this slot using the <customization_header> element. Each <customization_header> entry is its own header in the sidebar.
<app_header id="task_foo_app_customizations"
text="This version of MAT is customized for the Foo task"
brand_url="doc/foo_overview.html">
<doc_entry text="Overview" url="doc/foo_overview.html"/>
</app_header>
</doc_enhancement>
<doc_enhancement>
<customization_header id="task_foo_customizations"
text="The Foo task"
brand_url="doc/foo_overview.html">
<doc_entry text="Overview" url="doc/foo_overview.html"/>
</app_header>
</doc_enhancement>
In addition to these basic capabilities, a <doc_section>
element is available to create indented subsections within your
sections, and an <external_id_header> element is available
to add additional entries to a toplevel section defined by a
parent task.
If you want to annotate, say, Hebrew, you can define it as a
right-to-left language as follows:
<languages>
...
<language name="Hebrew" code="he" text_right_to_left="yes"/>
...
</languages>
If you know you want Java to be called with, say, 4GB of heap by
default (by the jCarafe tokenizer, trainer, and tagger), you can
set this globally in your task:
<engines>
<java_subprocess_parameters heap_size="4G"/>
...
</engines>
MAT knows that an engine is trainable because you define at least
one model configuration for it. But you can define multiple
configurations, if, e.g., you want to use different training
strategies, or you want to have configurations which bear
different numbers of model-building iterations. (All these options
can be configured on the command line, so you don't strictly need
multiple configurations; but you might find them convenient.)
Here, the default configuration (the unnamed one) uses the PSA
training method with 6 iterations; the alternative configuration
uses the standard training method with the standard number of
iterations:
<engine name='carafe_tag_engine'>
...
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
<build_settings training_method='psa' max_iterations='6'/>
</model_config>
<model_config config_name='alt_model_build'
class='MAT.JavaCarafe.CarafeModelBuilder'/>
...
</engine>
MAT 3.0 provides the option of multiple mixed-initiative steps,
as we've seen in the Sample
Relations task. Such a task starts with multiple annotation
sets.
New:
<annotations inherit='category:zone,category:token'>
...
<span label="LABEL1" of_set="set1" of_category="content"/>
<span label="LABEL2" of_set="set2" of_category="content"/>
...
</annotations>
Legacy:
<annotation_set_descriptors inherit='category:zone,category:token'>
<annotation_set_descriptor name="set1" category="content">
...
<annotation label="LABEL1"/>
...
</annotation_set_descriptor>
<annotation_set_descriptor name="set2" category="content">
...
<annotation label="LABEL2"/>
...
</annotation_set_descriptor>
</annotation_set_descriptors>
The key to defining multiple possibilities for mixed initiative
is in the steps:
<steps>
<annotation_step name="set1_step" type="mixed"
engine="..." sets_added="set1"/>
<annotation_step name="set2_step" type="mixed"
engine="..." sets_added="set2"/>
<annotation_step name="correction_step" type="hand"
sets_modified="set1,set2"/> </steps>
The idea here is that there are separate engines devoted to
adding the annotations in set1 and set2. An annotator might want
to do pretagging and correction for each set individually:
<workflows>
<workflow name="Mixed Initiative">
<step name="set1_step" pretty_name="Do set1"/>
<step name="set2_step" pretty_name="Do set2"/> </workflow>
...
</workflows>
This workflow requires that all annotations in set1 are completed
before set2 is begun. But even if this expresses the annotator's
normal workflow, there may be times when the annotator finds an
error in the set1 annotations while she's annotating set2. That's
where the final step comes in:
<workflows>
...
<workflow name="Correction">
<step name="correction_step" pretty_name="Correct"/>
</workflow>
</workflows>
During hand annotation, the annotator can simply switch to the
Correction workflow. The single step in this workflow is defined
as only modifying sets, which is interpreted by the UI as
permitting the annotator to modify sets which have been otherwise
completed. Once corrections in previous steps have been done, the
annotator can simply return to the Mixed Initiative workflow and
continue.
Note: If you're correcting in the midst of other
annotation, as this example suggests, don't complete the
correction step in the correction workflow. This will mark all
your annotation sets gold, which isn't want you want in this case,
since you'll be returning to annotation for set2.
One application of this sort of complex workflow might involve
separating span tagging from attribute filling. E.g., you might
have annotation sets like the following.
New:
<annotations inherit='category:zone,category:token'>
<span label="PERSON" of_set="spans">
<string name="nomtype" choices="NAM,NOM,PRO" of_set="attrs"/>
</span>
<span label="ORGANIZATION" of_set="spans">
<string name="nomtype" choices="NAM,NOM,PRO" of_set="attrs"/>
</span>
<span label="LOCATION" of_set="spans">
<string name="nomtype" choices="NAM,NOM,PRO" of_set="attrs"/>
</span>
</annotations>
Legacy:
<annotation_set_descriptors inherit='category:zone,category:token'>
<annotation_set_descriptor name="spans" category="content">
<annotation label="PERSON"/>
<annotation label="ORGANIZATION"/>
<annotation label="LOCATION"/>
</annotation_set_descriptor>
<annotation_set_descriptor name="attrs" category="content">
<attribute of_annotation="PERSON,LOCATION,ORGANIZATION" name="nomtype">
<choice>NAM</choice>
<choice>NOM</choice>
<choice>PRO</choice>
</attribute>
</annotation_set_descriptor>
</annotation_set_descriptors>
So the "attrs" set descriptor, while still in the content
category, defines an attribute on the annotations which are
defined in the "spans" set descriptor. So given the following
steps and workflow:
<steps>
<annotation_step type="hand" name="add spans" sets_added="spans"/>
<annotation_step type="hand" name="add attrs" sets_added="attrs"/>
</steps>
<workflows>
<workflow name="Annotation">
<step name="add spans"/>
<step name="add attrs"/>
</workflow>
</workflows>
you'll be able in the first step only to add the spans, but not
edit the "nomtype" attribute; and in the second step, only edit
the "nomtype" attribute of existing spans, but not create, modify
or delete the spans.
This partition of activities interacts with attribute defaults in
a perhaps unexpected way. If the attribute has a default, it will
be assigned when the annotation is created, even if the current
step doesn't support adding or editing the attribute. Accordingly,
if the workflow is undoable,
any attribute which has a default will be restored to the default
value when the step that adds the attribute is undone.