Task XML use cases

Use cases for the XML format for the task files (see "Creating a New Task") are described in this document. The reference document is found here. Click here for a split-screen view.

Although the annotation types are defined in the task.xml file, we provide examples for them separately. Here, we focus on the other elements of the task.xml file.

Customizing the overall UI

Changing the order of annotation labels

By default, the labels available to the annotator are listed in the legend by annotation set, and then alphabetized within each set. Similarly, in the annotation editor popups, the labels are alphabetized by default. If you want the labels to appear in the order they were defined in the task.xml file, you can do this:

  <web_customization alphabetize_labels="no"/>

Changing the name of the UI

If you want to "brand" your annotation task, you can change the title of the Web page and the "slug" that appears at the leftmost edge of the MAT desktop toolbar, if all your installed tasks have the same branding information (if not, it'll just use the default MAT branding). You can change the branding like this:

  <web_customization>
<short_name>MCAT</short_name>
<long_name>MCAT: My Company's Annotation Tool</long_name>
</web_customization>

Adding documentation for your task

You might want to write documentation for your task, and have it available in the MAT documentation accessible within the MAT UI. This functionality has been available in MAT for a very long time, but exposed in a user-exploitable way only in MAT 3.3.

For reasons that will be clear in a moment, it's important to understand the URL structure of the documentation. The MAT documentation is available at the path /MAT/doc/html/. Every task documentation page is available via the path /MAT/doc/html/tasks/<task_dir_name>/<path_within_task_dir>. So if you place a documentation page within the doc subdirectory, it'll be at
/MAT/doc/html/tasks/<task_dir_name>/doc/<page>.html. Keep this in mind.

Your first step is to write your documentation, in HTML, and place it in your task directory. You can put it anywhere, but the conventional place to put it is in a subdirectory named "doc". In order to maintain look-and-feel consistency with the rest of the documentation, the MAT doc.css file should be included. This file's path, relative to the server root, is /MAT/doc/css/doc.css, so, keeping in mind the URL structure above, if your page is within the doc subdirectory in your task, you'll add the following CSS link within your <head>:

<link href="../../../../css/doc.css" rel="stylesheet" type="text/css">

In other words, that's four directories up. It's preferable to define a relative path instead of specifying the absolute path starting from the server, because if you build a MAT distribution which includes your task, a static subset of the documentation (which is intended to be reviewable without a Web server) is created within the documentation which incorporates the task documentation, and the URL hierarchy within the static documentation just happens to preserve this relative structure.

Once you create your documentation, you can have it placed in one of two positions in the documentation sidebar, using the <doc_enhancement> element. All the documentation paths you provide with this mechanism should be relative to the task directory root.

The "application" slot is at the top of the sidebar, above "Getting started". You can insert material into this slot using the <app_header> element. You might use this slot if you're using MAT for a single purpose that you want the documentation to lead with. You also have the option of "branding" the documentation, changing the documentation window title and the initial page, if you additionally provide the brand_url attribute, whose value should be a URL which is also available via a link within the documentation section.

Each <app_header> entry is its own header in the sidebar, so it will seldom make sense for you to provide more than one of these, or install multiple tasks with this entry.
  <doc_enhancement>
<app_header id="task_foo_app_customizations"
text="This version of MAT is customized for the Foo task"
brand_url="doc/foo_overview.html">
<doc_entry text="Overview" url="doc/foo_overview.html"/>
</app_header>
</doc_enhancement>
The "customization" slot is near the top of the sidebar, above "For users". This is the default place to put documentation about your task. You can insert material into this slot using the <customization_header> element. Each <customization_header> entry is its own header in the sidebar.
  <doc_enhancement>
<customization_header id="task_foo_customizations"
text="The Foo task"
brand_url="doc/foo_overview.html">
<doc_entry text="Overview" url="doc/foo_overview.html"/>
</app_header>
</doc_enhancement>

In addition to these basic capabilities, a <doc_section> element is available to create indented subsections within your sections, and an <external_id_header> element is available to add additional entries to a toplevel section defined by a parent task.

Modifying the language information

Working with right-to-left languages

If you want to annotate, say, Hebrew, you can define it as a right-to-left language as follows:

  <languages>
...
<language name="Hebrew" code="he" text_right_to_left="yes"/>
...
</languages>

Defining engines

Changing the Java default heap and stack sizes

If you know you want Java to be called with, say, 4GB of heap by default (by the jCarafe tokenizer, trainer, and tagger), you can set this globally in your task:

  <engines>
<java_subprocess_parameters heap_size="4G"/>
...
</engines>

Defining multiple model configurations for a trainable engine

MAT knows that an engine is trainable because you define at least one model configuration for it. But you can define multiple configurations, if, e.g., you want to use different training strategies, or you want to have configurations which bear different numbers of model-building iterations. (All these options can be configured on the command line, so you don't strictly need multiple configurations; but you might find them convenient.) Here, the default configuration (the unnamed one) uses the PSA training method with 6 iterations; the alternative configuration uses the standard training method with the standard number of iterations:

  <engine name='carafe_tag_engine'>
...
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
<build_settings training_method='psa' max_iterations='6'/>
</model_config>
<model_config config_name='alt_model_build'
class='MAT.JavaCarafe.CarafeModelBuilder'/>
...
</engine>

Defining complex workflows

Multiple mixed-initiative steps with the possibility of correction

MAT 3.0 provides the option of multiple mixed-initiative steps, as we've seen in the Sample Relations task. Such a task starts with multiple annotation sets.

New:

  <annotations inherit='category:zone,category:token'>
...
<span label="LABEL1" of_set="set1" of_category="content"/>
<span label="LABEL2" of_set="set2" of_category="content"/>
...
</annotations>

Legacy:

  <annotation_set_descriptors inherit='category:zone,category:token'>
<annotation_set_descriptor name="set1" category="content">
...
<annotation label="LABEL1"/>
...
</annotation_set_descriptor>
<annotation_set_descriptor name="set2" category="content">
...
<annotation label="LABEL2"/>
...
</annotation_set_descriptor>
</annotation_set_descriptors>

The key to defining multiple possibilities for mixed initiative is in the steps:

  <steps>
<annotation_step name="set1_step" type="mixed"
engine="..." sets_added="set1"/>
<annotation_step name="set2_step" type="mixed"
engine="..." sets_added="set2"/>
<annotation_step name="correction_step" type="hand"
sets_modified="set1,set2"/> </steps>

The idea here is that there are separate engines devoted to adding the annotations in set1 and set2. An annotator might want to do pretagging and correction for each set individually:

  <workflows>
<workflow name="Mixed Initiative">
<step name="set1_step" pretty_name="Do set1"/>
<step name="set2_step" pretty_name="Do set2"/> </workflow>
...
</workflows>

This workflow requires that all annotations in set1 are completed before set2 is begun. But even if this expresses the annotator's normal workflow, there may be times when the annotator finds an error in the set1 annotations while she's annotating set2. That's where the final step comes in:

<workflows>
...
<workflow name="Correction">
<step name="correction_step" pretty_name="Correct"/>
</workflow>
</workflows>

During hand annotation, the annotator can simply switch to the Correction workflow. The single step in this workflow is defined as only modifying sets, which is interpreted by the UI as permitting the annotator to modify sets which have been otherwise completed. Once corrections in previous steps have been done, the annotator can simply return to the Mixed Initiative workflow and continue.

Note: If you're correcting in the midst of other annotation, as this example suggests, don't complete the correction step in the correction workflow. This will mark all your annotation sets gold, which isn't want you want in this case, since you'll be returning to annotation for set2.

Defining a workflow step which adds attributes only

One application of this sort of complex workflow might involve separating span tagging from attribute filling. E.g., you might have annotation sets like the following.

New:

  <annotations inherit='category:zone,category:token'>
<span label="PERSON" of_set="spans">
<string name="nomtype" choices="NAM,NOM,PRO" of_set="attrs"/>
</span>
<span label="ORGANIZATION" of_set="spans">
<string name="nomtype" choices="NAM,NOM,PRO" of_set="attrs"/>
</span>
 <span label="LOCATION" of_set="spans">
<string name="nomtype" choices="NAM,NOM,PRO" of_set="attrs"/>
</span>
</annotations>

Legacy:

  <annotation_set_descriptors inherit='category:zone,category:token'>
<annotation_set_descriptor name="spans" category="content">
<annotation label="PERSON"/>
<annotation label="ORGANIZATION"/>
<annotation label="LOCATION"/>
</annotation_set_descriptor>
<annotation_set_descriptor name="attrs" category="content">
<attribute of_annotation="PERSON,LOCATION,ORGANIZATION" name="nomtype">
<choice>NAM</choice>
<choice>NOM</choice>
<choice>PRO</choice>
</attribute>
</annotation_set_descriptor>
</annotation_set_descriptors>

So the "attrs" set descriptor, while still in the content category, defines an attribute on the annotations which are defined in the "spans" set descriptor. So given the following steps and workflow:

<steps>
<annotation_step type="hand" name="add spans" sets_added="spans"/>
<annotation_step type="hand" name="add attrs" sets_added="attrs"/>
</steps>
<workflows>
<workflow name="Annotation">
<step name="add spans"/>
<step name="add attrs"/>
</workflow>
</workflows>

you'll be able in the first step only to add the spans, but not edit the "nomtype" attribute; and in the second step, only edit the "nomtype" attribute of existing spans, but not create, modify or delete the spans.

This partition of activities interacts with attribute defaults in a perhaps unexpected way. If the attribute has a default, it will be assigned when the annotation is created, even if the current step doesn't support adding or editing the attribute. Accordingly, if the workflow is undoable, any attribute which has a default will be restored to the default value when the step that adds the attribute is undone.