Creating a new task

If you haven't received an already-customized version of MAT, and you want to do something besides the default named entity task, you're going to want to define your own task. This document describes how to do that for simple tasks.

Step 1: Set up your directory
Step 2: Create a task.xml file
Step 3: Name your task and set up your template
Step 4: Declare your annotations
Step 5: Declare your engines

Zoner
Tokenizer/sentence tagger
CRF trainer/tagger
Maximum-entropy trainer/tagger

Step 6: Define your steps and workflows
Step 7: Define your workspace configurations
Step 8: Install the task
Further reading

Step 1: Set up your directory

Create a directory. This directory might ultimately have various subdirectories; for instance, custom Python code must live in files in a python/ subdirectory, and custom Javascript code should live in files in a js/ subdirectory. But you don't need to know about those right now.

Step 2: Create a task.xml file

Most of what we're going to talk about in this document is the task.xml file. You can get a better idea of what this file consists of by looking at the task documentation, the documentation on the sample tasks, and the documentation on the task XML and annotation set descriptor XML itself.

For now, just open an empty file named task.xml and save the empty file in your directory created in step 1.

Step 3: Name your task and set up your template

Create a top-level <task> element, and give your task a name:

<task name="Widget Annotation">
  <languages>
    <language code='en' name='English' tokenless_autotag_delimiters='.,/?!;:'/>
  </languages>
  <annotations>
  </annotations>
</task>

The <languages> element is obligatory; each task needs to declare which languages it intends to apply to. The entry for English provided here can be reused in your various tasks.

Your annotation labels and attributes and their display properties will be listed in the <annotations> element, so we've added one of those here. (There's a more complex and verbose legacy mechanism which arranges this information a bit differently which we're not going to exemplify here; you can see the sample tasks for examples, and the use cases for motivation for this other mechanism.)

You can specify a lot of complexity here. MAT enables you to subdivide your labels into a number of different annotation label sets, which can be associated with various steps in your workflow, and you can group these sets into categories. Both of these levels can be useful in specifying which descriptors you want to inherit, or in specifying the sets added or modified in your steps (see below), or for specifying inclusions or exclusions to MATScore or MATReport. In the examples below, we're going to ignore all of the complexities of these different sets and categories. Your annotations will be placed in the default category ("content") and the default set (also "content"). (For more complex examples, see tasks such as the "Sample Relations" task, where you have multiple trainable engines, and see here for information about more complex tasks.)

Step 4: Declare your annotations

Let's assume that you're going to use the default MAT automated tools (tagger, trainer, tokenizer). So first, you'll want to inherit the zone and token annotations from the core task. So so far, your annotation sections look like this:

...
  <annotations inherit="category:zone,category:token">
  </annotations>
...

These two categories are defined by MAT, for your use.

Now, let's populate your content annotations.

Let's start by defining a simple task, in fact, the MUC named entity task. This task has three spanned annotations: PERSON, LOCATION and ORGANIZATION. Let's define them:

...
  <annotations inherit="category:zone,category:token">
    <span label='PERSON'/>
    <span label='LOCATION'/>
    <span label='ORGANIZATION'/>
  </annotations>
...

Your task now "knows" about these labels. So for example, if you refer to "set:content" or "category:content" when using MATScore with this task, MATScore will know you're referring to these three labels (because they're in the default set and category).

But if you want to see them in the MAT UI, or add them by hand, you'll have to do something else: you'll have to give them display properties. These properties are attributes of the <span> element, and they all start with "d_":

...
  <annotations inherit="category:zone,category:token">
    <span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'/>
    <span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L'/>
    <span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
  </annotations>
...

Now these annotations will be styled with the colors shown, and when you swipe in the MAT UI to add an annotation, you'll be able to use the keyboard accelerators instead of the mouse (e.g., when the popup menu is visible, pressing 'P' will be the same as selecting the "PERSON" item).

Let's say that you want to add some attributes to the LOCATION annotation: a "nomtype" choice attribute which indicates whether the element is a proper name, a nominal, or a pronoun; a boolean "is_gpe" to indicate whether the location is a geopolitical entity, and a "comment" field to allow the annotator to add comments. Let's do those now:

...
  <annotations inherit="category:zone,category:token">
    <span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'/>
    <span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L'>
      <string name='nomtype' choices='proper name,nominal,pronoun'/>
      <boolean name='is_gpe'/>
      <string name='comment'/>
    </span>
    <span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
  </annotations>
...

Attributes are defined in the scope of their annotation label, so we had to "open up" the <span> element that declares the LOCATION annotation. Notice that the types of the attributes is encoded in the element name. We've given the "nomtype" attribute three fixed choices, which are comma-separated; you can also use spaces as separators if there are no commas in your choices.

Now, in the MAT UI, if you click on an existing LOCATION annotation, you'll be presented with an option to edit that annotation in an annotation editor.

You can also add annotations with integer or float values (and optionally assign range limitations to them). For complete details, see here.

Let's refine the annotation display further. Let's say that you want to bring up the annotation editor immediately when you create the LOCATION annotation; we'll add the edit_immediately="yes" setting to make that happen. And let's say that you want keyboard accelerators for the values of the "nomtype" attribute, for those places in the UI where menus for this attribute are presented.

...
  <annotations inherit="category:zone,category:token">
    <span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'/>
    <span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
      <string name='nomtype'>
        <choice accelerator='P' value='proper name'/>
        <choice accelerator='N' value='nominal'/>
        <choice accelerator='O' value='pronoun'/>
      </string>
      <boolean name='is_gpe'/>
      <string name='comment'/>
    </span>
    <span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
  </annotations>
...

Notice that in order to do this, we had to make the declaration of the "nomtype" choices a bit more verbose. Instead of using an attribute on <string>, we've "opened up" the <string> element and inserted <choice> element children to host the value and accelerator. This more verbose alternative is also useful when your choices contain both commas and strings, and can't be separated reliably using the conventions for the "choices" XML attribute.

Now let's add another attribute: an attribute of PERSON named "roles", which can be a set of roles drawn from a set of possible choices. This shows that in addition to single values, you can have attributes whose values are sets (in this case, sets of strings) or lists. And let's give this attribute a default value:

...
  <annotations inherit="category:zone,category:token">
    <span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
      <string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
    </span>
    <span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
      <string name='nomtype'>
        <choice accelerator='P' value='proper name'/>
        <choice accelerator='N' value='nominal'/>
        <choice accelerator='O' value='pronoun'/>
      </string>
      <boolean name='is_gpe'/>
      <string name='comment'/>
    </span>
    <span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
  </annotations>
...

Set-valued attributes will be rendered as a multiple-choice menu in the MAT UI. Notice that the different attribute aggregation possibilities (sets and lists) are encoded, like the attribute types, as part of the element name.

Let's complete our set of span attributes with a special label for hand annotation: WUZZA. Let's say we want to use this annotation when we suspect, as human annotators, that the span ought to be annotated, but we're not sure how. These annotations aren't intended to be processed, so we want the automated steps (e.g., the training engine and the scorer) to ignore these annotations. So we give them a special attribute, and make sure to give it a style:

... 
  <annotations inherit="category:zone,category:token">
    <span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
      <string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
    </span>
    <span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
      <string name='nomtype'>
        <choice accelerator='P' value='proper name'/>
        <choice accelerator='N' value='nominal'/>
        <choice accelerator='O' value='pronoun'/>
      </string>
      <boolean name='is_gpe'/>
      <string name='comment'/>
    </span>
    <span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
    <span label='WUZZA' processable='no' d_css='background-color: gray'/>
  </annotations>
...

(We could also "protect" this label by setting up multiple annotation set descriptors, but as we said, we're ignoring that option in this example.)

Now that we've got a good handle on the basic annotation options, let's move on to annotations that point to other annotations, and spanless annotations.

Let's say that now we want to add a label that we intend to use with verbs like "work for" or "employ". Let's call it EMPLOYS. And let's set it up so that you can link it to an employer (an ORGANIZATION) and an employee (a PERSON). And just to amuse ourselves, when we style it, we'll change its text foreground color, so that it's white on a dark background:

...
  <annotations inherit="category:zone,category:token">
    <span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
      <string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
    </span>
    <span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
      <string name='nomtype'>
        <choice accelerator='P' value='proper name'/>
        <choice accelerator='N' value='nominal'/>
        <choice accelerator='O' value='pronoun'/>
      </string>
      <boolean name='is_gpe'/>
      <string name='comment'/>
    </span>
    <span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
    <span label='WUZZA' processable='no' d_css='background-color: gray'/>
    <span label='EMPLOYS' d_css='color: white; background-color: purple'>
      <filler name='employer' filler_types='ORGANIZATION'/>
      <filler name='employee' filler_types='PERSON'/>
    </span>
  </annotations>
...

When you add this annotation in the MAT UI, it'll look like any other span annotation; but you'll also have a number of options for linking it to other annotations via its annotation-typed attributes (introduced by the <filler> element here).

You can have multiple possibilities for label fillers. So, for instance, you could specify that employers can be either organizations or people:

...
  <annotations inherit="category:zone,category:token">
    <span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
      <string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
    </span>
    <span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
      <string name='nomtype'>
        <choice accelerator='P' value='proper name'/>
        <choice accelerator='N' value='nominal'/>
        <choice accelerator='O' value='pronoun'/>
      </string>
      <boolean name='is_gpe'/>
      <string name='comment'/>
    </span>
    <span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
    <span label='WUZZA' processable='no' d_css='background-color: gray'/>
    <span label='EMPLOYS' d_css='color: white; background-color: purple'>
      <filler name='employer' filler_types='ORGANIZATION PERSON'/>
      <filler name='employee' filler_types='PERSON'/>
    </span>
  </annotations>
...

Like the choices, the filler types can be either comma- or space-separated.

Finally, annotations don't need to be anchored to a particular span. In the MAT UI, we create, edit and view these annotations in the spanless sidebar. These annotations are defined and styled just like any other annotation, except they're declared with the <spanless> element rather than the <span> element:

...
  <annotations inherit="category:zone,category:token">
    <span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
      <string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
    </span>
    <span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
      <string name='nomtype'>
        <choice accelerator='P' value='proper name'/>
        <choice accelerator='N' value='nominal'/>
        <choice accelerator='O' value='pronoun'/>
      </string>
      <boolean name='is_gpe'/>
      <string name='comment'/>
    </span>
    <span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
    <span label='WUZZA' processable='no' d_css='background-color: gray'/>
    <span label='EMPLOYS' d_css='color: white; background-color: purple'>
      <filler name='employer' filler_types='ORGANIZATION PERSON'/>
      <filler name='employee' filler_types='PERSON'/>
    </span>
    <spanless label='PARENT_OF' d_css='background-color: pink'>
      <filler name='parent' filler_types='PERSON'/>
      <filler name='child' filler_types='PERSON'/>
    </spanless>
  </annotations>
...

(Well, not exactly like other annotations; spanless annotations have edit_immediately='yes' set by default, and it doesn't make any sense to style their foreground colors, since they don't span any text.)

There's lots more you can do to define annotations and their displays; see the annotation set descriptor XML use cases and the task XML use cases for examples.

Step 5: Declare your engines

Now, if you're going to do any automated processing, it's time to define your engines. Details of the engines discussed below can be found here.

An engine is an automated tagger. It may be trainable, or not. The core task doesn't define any engines for you which you can inherit. In this section, we describe the engines you might want. You'll see that each of these engines has a <step_config>, which specifies the Python class which implements the tagger.

Zoner

Your first engine might declare what sections of the document are annotatable. MAT has a special category of annotations called zone, which controls this. As described above, you can inherit category:zone from the core task to access the default zone annotation. MAT provides a simple engine which marks the entire document annotatable. The value of this engine is historic; from MAT 3.0 forward, unzoned documents are treated as if they've been processed with this engine.

  <engines>
    ...
    <!-- simple zoner which marks the entire document annotatable -->
    <engine name='whole_zone_engine'>
      <step_config class='MAT.PluginMgr.WholeZoneStep'/>
    </engine>
    ...
  </engines>

This engine is not trainable.

Tokenizer/sentence tagger

Your next engine might tokenize the annotatable regions, and/or segment these regions into sentences. The default jCarafe engine provides a (non-trainable) engine which provides reasonable tokenization, and simple sentence segmentation, for English. MAT provides a special annotation category called token which contains the single lex annotation which jCarafe issues. As described above, you can inherit category:token from the core task to access this annotation.

There is no corresponding special category for sentence annotations.

  <engines>
    ...
    <!-- the Carafe English tokenizer -->
    <engine name='carafe_tokenize_engine'>
      <step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
    </engine>
    ...
  </engines>

By default, this step only captures the tokens. For an example of how sentence segmentation is captured and used, in the context of sentence classification, see the sample tasks.

CRF trainer/tagger

Your next engine might learn and apply span labels. If an engine is trainable engine, it has a <model_config> element in addition to a <step_config> element, which specifies the Python class which implements the trainer. MAT provides the jCarafe conditional random field trainer/tagger to accomplish this.

  <engines>
    ...
    <!-- the jCarafe CRF trainer/tagger -->
    <engine name='carafe_tag_engine'>
      <model_config class='MAT.JavaCarafe.CarafeModelBuilder'/>
      <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
    </engine>
    ...
  </engines>

When an engine is trainable, you can specify some other settings, e.g.:

    <engine name='carafe_tag_engine'>
      <default_model>default_model</default_model>
      <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
        <build_settings training_method='psa' max_iterations='6'/>
      </model_config>
      <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
    </engine>

The <default_model> element instructs the trainer to save default models to the file "default_model" in your task directory. You can also customize your model builder using the <build_settings> element; for instance, here we use the faster, but possibly less-well-performing (and slightly less reliable) periodic stepsize adjustment training method.

  <model_config class="MAT.JavaCarafe.CarafeModelBuilder">
    <build_settings training_method="psa" max_iterations="6"/>
  </model_config>

There are lots of ways of customizing the jCarafe model builder. See MATModelBuilder and the jCarafe engine documentation for more details about these settings.

Maximum-entropy trainer/tagger

Another engine you might be interested in is one which learns and applies the values of specific attributes. These attributes can be string or choice attributes, or boolean attributes. The default jCarafe engine provides such a trainer/tagger. You can use such an engine for a task like sentence classification. For a detailed example, see the sample tasks.

  <engines>
    ...
    <!-- the jCarafe maxent trainer/tagger -->
    <engine name='classifier_engine'>
      <model_config class='MAT.JavaCarafe.JCarafeMaxentClassifierModelBuilder'>
        <build_settings feature_extractors="_bagOfWords,_bigrams"/>
      </model_config>
      <step_config class='MAT.JavaCarafe.JCarafeMaxentClassifierTagStep'/>
    </engine>
    ...
  </engines>

The definition of this engine is slightly different, because it works differently than the conditional random field engine. See the jCarafe documentation for further discussion.

Step 6: Define your steps and workflows

Next, if you want, you'll declare your steps. You don't need to declare steps and workflows; if you're just defining your annotation sets in order to run MATScore, MATReport or MATTransducer, there's no need to specify them. But if you're going to do any annotation, you're going to want them.

Steps are global, and workflows use them to define the order of annotation. There are three kinds of steps: signal steps, annotation steps, and transform steps. You'll almost certainly be using only annotation steps in your task, and we'll describe only those here.

There are four types of annotation steps:

hand annotation steps, which have no engine and can only be applied in the UI
automated steps, which have an engine which must be used (i.e., you can't add or modify these annotations by hand)
correction steps, which have an engine whose output can be corrected if you're in the UI
mixed steps, which have an engine, in which you can either do hand annotation in the UI or automated annotation followed by correction

When you specify an annotation step, you must indicate which annotation sets are added or modified by that step. The MAT UI will make those annotation sets - and only those annotation sets - available to you when you hand-annotate. The set names are the values of the "name" attributes of the <annotation_set_descriptor> elements. You can also refer to entire categories, using the "category:" prefix.

  <steps>
    <annotation_step engine='carafe_tag_engine' sets_added='content'
                     type='mixed' name='carafe_tag'/>
    <annotation_step engine='whole_zone_engine' sets_added='category:zone'
                     type='auto' name='whole_zone'/>
    <annotation_step engine='carafe_tokenize_engine'
                     sets_added='category:token' type='auto'
                     name='carafe_tokenize'/>
    <annotation_step type='hand' name='correct'
                     sets_modified='content'/>
  </steps>

Here, we have four steps. The first three each have an engine. Two of them are "auto" steps, which means you can't modify them via hand annotation, and one of them, carafe_tag, is a mixed step, which is where you can do your mixed-initiative tagging, either annotating the step automatically, entirely from scratch, or pretagging and correcting by hand. The fourth step here is a hand annotation step; it has no engine.

Once you have your steps defined, you can organize them into workflows:

  <workflows>
    <workflow name='Tokenless hand annotation'>
      <step pretty_name='zone' name='whole_zone'/>
      <step name='carafe_tag' pretty_name='hand tag' type='hand'/>
    </workflow>
    <workflow name='Review/repair'>
      <step name='correct'/>
    </workflow>
    <workflow name='Demo' undoable="yes">
      <step pretty_name='zone' name='whole_zone'/>
      <step pretty_name='tokenize' name='carafe_tokenize'/>
      <step pretty_name='tag' name='carafe_tag'/>
    </workflow>
  </workflows>

Within workflows, you can assign alternate names to your steps, if you want to refer to them differently depending on what workflow they're in. You can also further limit the step type; so in the "Tokenless hand annotation" workflow here, we're using the carafe_tag step, but we're limiting it to hand annotation only.

Finally, you can designate your workflow as undoable. By default, you can't undo steps in MAT; this is because there's no global order to steps, and if you undo steps in a given workflow, you may leave your document in an odd state. For instance, if you apply the "Demo" workflow here to a document, and then later load it using the "Tokenless hand annotation" workflow and undo that workflow, the "tokenize" step will still be done. We recommend that if you want to make one of your workflows undoable, you should select a workflow in your task which applies all of the annotation sets in your task.

You can find additional details about the steps and workflows here.

Step 7: Define your workspace configurations

In MAT, any workflow which contains at least one hand-annotatable annotation step can be used as the basis of a workspace. So in order to use workspaces, you don't really need to specify anything in your task.xml. However, there are things you might want to say.

For instance, if you have multiple workflows which can form a workspace, you may want to declare a default:

  <workspaces default_config="Demo"/>

This default will be used by MATWorkspaceEngine when it creates a workspace for your task, if you don't specify a workspace configuration.

Note that in the context of workspaces, workflows are referred to as workspace configurations. This is because you can define named custom configurations of workflows to use for your workspaces. Within these configurations, you can specify parameters to the operations the workspace calls. Only some of the operations accept custom parameters; see the workspace reference for details.

  <workspaces default_config="Demo">
    <workspace config_name="Custom Demo" workflow='Demo'>
      <operation name="modelbuild">
        <settings training_method="psa" max_iterations="10"/>
      </operation>
    </workspace>
  </workspaces>

Step 8: Install the task

Use the MATManagePluginDirs tool to ensure that MAT knows about your task directory. If <dir> is your task directory:

Unix:

% $MAT_PKG_HOME/bin/MATManagePluginDirs install <dir>

Windows native:

> %MAT_PKG_HOME%\bin\MATManagePluginDirs.cmd install <dir>

If you want to validate the task before you install it (if, e.g., your installation is shared and you don't want to mess up other people who are currently using it), use the "validate" action:

Unix:

% $MAT_PKG_HOME/bin/MATManagePluginDirs validate <dir>

Windows native:

> %MAT_PKG_HOME%\bin\MATManagePluginDirs.cmd validate <dir>