If you haven't received an already-customized version of MAT, and
you want to do something besides the default named entity task,
you're going to want to define your own task. This document
describes how to do that for simple tasks.
Create a directory. This directory might ultimately have various
subdirectories; for instance, custom Python code must live in
files in a python/ subdirectory, and custom Javascript code should
live in files in a js/ subdirectory. But you don't need to know
about those right now.
Most of what we're going to talk about in this document is the
task.xml file. You can get a better idea of what this file
consists of by looking at the task
documentation, the documentation on the sample tasks, and the documentation
on the task XML and annotation set descriptor XML
itself.
For now, just open an empty file named task.xml and save the
empty file in your directory created in step 1.
Create a top-level <task> element, and give your task a
name:
<task name="Widget Annotation">The <languages> element is obligatory; each task needs to declare which languages it intends to apply to. The entry for English provided here can be reused in your various tasks.
<languages>
<language code='en' name='English' tokenless_autotag_delimiters='.,/?!;:'/>
</languages>
<annotations>
</annotations>
</task>
Let's assume that you're going to use the default MAT automated
tools (tagger, trainer, tokenizer). So first, you'll want to
inherit the zone
and token annotations from the core task. So so far, your
annotation sections look like this:
...
<annotations inherit="category:zone,category:token">
</annotations>
...
These two categories are defined by MAT, for your use.
Now, let's populate your content annotations.
Let's start by defining a simple task, in fact, the MUC named
entity task. This task has three spanned annotations: PERSON,
LOCATION and ORGANIZATION. Let's define them:
...
<annotations inherit="category:zone,category:token">
<span label='PERSON'/>
<span label='LOCATION'/>
<span label='ORGANIZATION'/>
</annotations>
...
Your task now "knows" about these labels. So for example, if you
refer to "set:content" or "category:content" when using MATScore with this task, MATScore will
know you're referring to these three labels (because they're in
the default set and category).
But if you want to see them in the MAT UI, or add them by hand,
you'll have to do something else: you'll have to give them display
properties. These properties are attributes of the <span>
element, and they all start with "d_":
...
<annotations inherit="category:zone,category:token">
<span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'/>
<span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L'/>
<span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
</annotations>
...
Now these annotations will be styled with the colors shown, and
when you swipe in the MAT UI to add an annotation, you'll be able
to use the keyboard accelerators instead of the mouse (e.g., when
the popup menu is visible, pressing 'P' will be the same as
selecting the "PERSON" item).
Let's say that you want to add some attributes to the LOCATION
annotation: a "nomtype" choice attribute which indicates whether
the element is a proper name, a nominal, or a pronoun; a boolean
"is_gpe" to indicate whether the location is a geopolitical
entity, and a "comment" field to allow the annotator to add
comments. Let's do those now:
...
<annotations inherit="category:zone,category:token">
<span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'/>
<span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L'>
<string name='nomtype' choices='proper name,nominal,pronoun'/>
<boolean name='is_gpe'/>
<string name='comment'/>
</span>
<span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
</annotations>
...
Attributes are defined in the scope of their annotation label, so
we had to "open up" the <span> element that declares the
LOCATION annotation. Notice that the types of the attributes is
encoded in the element name. We've given the "nomtype" attribute
three fixed choices, which are comma-separated; you can also use
spaces as separators if there are no commas in your choices.
Now, in the MAT UI, if you click on an existing LOCATION
annotation, you'll be presented with an option to edit that annotation in an
annotation editor.
You can also add annotations with integer or float values (and
optionally assign range limitations to them). For complete
details, see here.
Let's refine the annotation display further. Let's say that you
want to bring up the annotation editor immediately when you create
the LOCATION annotation; we'll add the edit_immediately="yes"
setting to make that happen. And let's say that you want keyboard
accelerators for the values of the "nomtype" attribute, for those
places in the UI where menus for this attribute are presented.
...
<annotations inherit="category:zone,category:token">
<span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'/>
<span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
<string name='nomtype'>
<choice accelerator='P' value='proper name'/>
<choice accelerator='N' value='nominal'/>
<choice accelerator='O' value='pronoun'/>
</string>
<boolean name='is_gpe'/>
<string name='comment'/>
</span>
<span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
</annotations>
...
Notice that in order to do this, we had to make the declaration
of the "nomtype" choices a bit more verbose. Instead of using an
attribute on <string>, we've "opened up" the <string>
element and inserted <choice> element children to host the
value and accelerator. This more verbose alternative is also
useful when your choices contain both commas and strings, and
can't be separated reliably using the conventions for the
"choices" XML attribute.
Now let's add another attribute: an attribute of PERSON named
"roles", which can be a set of roles drawn from a set of possible
choices. This shows that in addition to single values, you can
have attributes whose values are sets (in this case, sets of
strings) or lists. And let's give this attribute a default value:
...
<annotations inherit="category:zone,category:token">
<span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
<string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
</span>
<span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
<string name='nomtype'>
<choice accelerator='P' value='proper name'/>
<choice accelerator='N' value='nominal'/>
<choice accelerator='O' value='pronoun'/>
</string>
<boolean name='is_gpe'/>
<string name='comment'/>
</span>
<span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
</annotations>
...
Set-valued attributes will be rendered as a multiple-choice menu
in the MAT UI. Notice that the different attribute aggregation
possibilities (sets and lists) are encoded, like the attribute
types, as part of the element name.
Let's complete our set of span attributes with a special label
for hand annotation: WUZZA. Let's say we want to use this
annotation when we suspect, as human annotators, that the span
ought to be annotated, but we're not sure how. These annotations
aren't intended to be processed, so we want the automated steps
(e.g., the training engine and the scorer) to ignore these
annotations. So we give them a special attribute, and make sure to
give it a style:
...
<annotations inherit="category:zone,category:token">
<span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
<string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
</span>
<span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
<string name='nomtype'>
<choice accelerator='P' value='proper name'/>
<choice accelerator='N' value='nominal'/>
<choice accelerator='O' value='pronoun'/>
</string>
<boolean name='is_gpe'/>
<string name='comment'/>
</span>
<span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
<span label='WUZZA' processable='no' d_css='background-color: gray'/>
</annotations>
...
(We could also "protect" this label by setting up multiple
annotation set descriptors, but as we said, we're ignoring that
option in this example.)
Now that we've got a good handle on the basic annotation options,
let's move on to annotations that point to other annotations, and
spanless annotations.
Let's say that now we want to add a label that we intend to use
with verbs like "work for" or "employ". Let's call it EMPLOYS. And
let's set it up so that you can link it to an employer (an
ORGANIZATION) and an employee (a PERSON). And just to amuse
ourselves, when we style it, we'll change its text foreground
color, so that it's white on a dark background:
...
<annotations inherit="category:zone,category:token">
<span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
<string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
</span>
<span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
<string name='nomtype'>
<choice accelerator='P' value='proper name'/>
<choice accelerator='N' value='nominal'/>
<choice accelerator='O' value='pronoun'/>
</string>
<boolean name='is_gpe'/>
<string name='comment'/>
</span>
<span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
<span label='WUZZA' processable='no' d_css='background-color: gray'/>
<span label='EMPLOYS' d_css='color: white; background-color: purple'>
<filler name='employer' filler_types='ORGANIZATION'/>
<filler name='employee' filler_types='PERSON'/>
</span>
</annotations>
...
When you add this annotation in the MAT UI, it'll look like any
other span annotation; but you'll also have a number of options
for linking
it to other annotations via its annotation-typed attributes
(introduced by the <filler> element here).
You can have multiple possibilities for label fillers. So, for
instance, you could specify that employers can be either
organizations or people:
...
<annotations inherit="category:zone,category:token">
<span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
<string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
</span>
<span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
<string name='nomtype'>
<choice accelerator='P' value='proper name'/>
<choice accelerator='N' value='nominal'/>
<choice accelerator='O' value='pronoun'/>
</string>
<boolean name='is_gpe'/>
<string name='comment'/>
</span>
<span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
<span label='WUZZA' processable='no' d_css='background-color: gray'/>
<span label='EMPLOYS' d_css='color: white; background-color: purple'>
<filler name='employer' filler_types='ORGANIZATION PERSON'/>
<filler name='employee' filler_types='PERSON'/>
</span>
</annotations>
...
Like the choices, the filler types can be either comma- or
space-separated.
Finally, annotations don't need to be anchored to a particular
span. In the MAT UI, we create, edit and view these annotations in
the spanless sidebar. These
annotations are defined and styled just like any other annotation,
except they're declared with the <spanless> element rather
than the <span> element:
...
<annotations inherit="category:zone,category:token">
<span label='PERSON' d_css='background-color: #CCFF66' d_accelerator='P'>
<string_set name='roles' default='sales' choices='supervisor,technical contributor,sales'/>
</span>
<span label='LOCATION' d_css='background-color: #FF99CC' d_accelerator='L' d_edit_immediately='yes'>
<string name='nomtype'>
<choice accelerator='P' value='proper name'/>
<choice accelerator='N' value='nominal'/>
<choice accelerator='O' value='pronoun'/>
</string>
<boolean name='is_gpe'/>
<string name='comment'/>
</span>
<span label='ORGANIZATION' d_css='background-color: #99CCFF' d_accelerator='O'/>
<span label='WUZZA' processable='no' d_css='background-color: gray'/>
<span label='EMPLOYS' d_css='color: white; background-color: purple'>
<filler name='employer' filler_types='ORGANIZATION PERSON'/>
<filler name='employee' filler_types='PERSON'/>
</span>
<spanless label='PARENT_OF' d_css='background-color: pink'>
<filler name='parent' filler_types='PERSON'/>
<filler name='child' filler_types='PERSON'/>
</spanless>
</annotations>
...
(Well, not exactly like other annotations; spanless
annotations have edit_immediately='yes' set by default, and it
doesn't make any sense to style their foreground colors, since
they don't span any text.)
There's lots more you can do to define annotations and their
displays; see the annotation set
descriptor XML use cases and the task XML use cases for
examples.
Now, if you're going to do any automated processing, it's time to
define your engines. Details of the engines discussed below can be
found here.
An engine is an automated tagger. It may be trainable, or not. The core task doesn't define any engines for you which you can inherit. In this section, we describe the engines you might want. You'll see that each of these engines has a <step_config>, which specifies the Python class which implements the tagger.
Your first engine might declare what sections of the document are
annotatable. MAT has a special category of annotations called zone, which controls this. As
described above, you can inherit category:zone from the
core task to access the default zone annotation. MAT provides a
simple engine which marks the entire document annotatable. The
value of this engine is historic; from MAT 3.0 forward, unzoned
documents are treated as if they've been processed with this
engine.
<engines>
...
<!-- simple zoner which marks the entire document annotatable -->
<engine name='whole_zone_engine'>
<step_config class='MAT.PluginMgr.WholeZoneStep'/>
</engine>
...
</engines>
This engine is not trainable.
Your next engine might tokenize the annotatable regions, and/or
segment these regions into sentences. The default jCarafe engine provides a
(non-trainable) engine which provides reasonable tokenization, and
simple sentence segmentation, for English. MAT provides a special
annotation category called token which contains the single lex
annotation which jCarafe issues. As described above, you can
inherit category:token from the core task to access this
annotation.
There is no corresponding special category for sentence
annotations.
<engines>
...
<!-- the Carafe English tokenizer -->
<engine name='carafe_tokenize_engine'>
<step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
</engine>
...
</engines>
By default, this step only captures the tokens. For an example of
how sentence segmentation is captured and used, in the context of
sentence classification, see the sample
tasks.
Your next engine might learn and apply span labels. If an engine
is trainable engine, it has a <model_config> element in
addition to a <step_config> element, which specifies the
Python class which implements the trainer. MAT provides the jCarafe conditional random field
trainer/tagger to accomplish this.
<engines>When an engine is trainable, you can specify some other settings, e.g.:
...
<!-- the jCarafe CRF trainer/tagger -->
<engine name='carafe_tag_engine'>
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'/>
<step_config class='MAT.JavaCarafe.CarafeTagStep'/>
</engine>
...
</engines>
<engine name='carafe_tag_engine'>
<default_model>default_model</default_model>
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
<build_settings training_method='psa' max_iterations='6'/>
</model_config>
<step_config class='MAT.JavaCarafe.CarafeTagStep'/>
</engine>
The <default_model> element instructs the trainer to save
default models to the file "default_model" in your task directory.
You can also customize your model builder using the
<build_settings> element; for instance, here we use the
faster, but possibly less-well-performing (and slightly less
reliable) periodic stepsize adjustment training method.
<model_config class="MAT.JavaCarafe.CarafeModelBuilder">
<build_settings training_method="psa" max_iterations="6"/>
</model_config>
There are lots of ways of customizing the jCarafe model builder.
See MATModelBuilder and the jCarafe engine documentation for
more details about these settings.
Another engine you might be interested in is one which learns and
applies the values of specific attributes. These attributes can be
string or choice attributes, or boolean attributes. The default jCarafe engine provides such a
trainer/tagger. You can use such an engine for a task like
sentence classification. For a detailed example, see the sample tasks.
<engines>
...
<!-- the jCarafe maxent trainer/tagger -->
<engine name='classifier_engine'>
<model_config class='MAT.JavaCarafe.JCarafeMaxentClassifierModelBuilder'>
<build_settings feature_extractors="_bagOfWords,_bigrams"/>
</model_config>
<step_config class='MAT.JavaCarafe.JCarafeMaxentClassifierTagStep'/>
</engine>
...
</engines>
The definition of this engine is slightly different, because it
works differently than the conditional random field engine. See
the jCarafe documentation for further discussion.
Next, if you want, you'll declare your steps. You don't need to
declare steps and workflows; if you're just defining your
annotation sets in order to run MATScore,
MATReport or MATTransducer, there's no need to
specify them. But if you're going to do any annotation, you're
going to want them.
Steps are global, and workflows use them to define the order of
annotation. There are three kinds of steps: signal steps,
annotation steps, and transform steps. You'll almost certainly be
using only annotation steps in your task, and we'll describe only
those here.
There are four types of annotation steps:
When you specify an annotation step, you must indicate which
annotation sets are added or modified by that step. The MAT UI
will make those annotation sets - and only those annotation sets -
available to you when you hand-annotate. The set names are the
values of the "name" attributes of the
<annotation_set_descriptor> elements. You can also refer to
entire categories, using the "category:" prefix.
<steps>
<annotation_step engine='carafe_tag_engine' sets_added='content'
type='mixed' name='carafe_tag'/>
<annotation_step engine='whole_zone_engine' sets_added='category:zone'
type='auto' name='whole_zone'/>
<annotation_step engine='carafe_tokenize_engine'
sets_added='category:token' type='auto'
name='carafe_tokenize'/>
<annotation_step type='hand' name='correct'
sets_modified='content'/>
</steps>
Here, we have four steps. The first three each have an engine.
Two of them are "auto" steps, which means you can't modify them
via hand annotation, and one of them, carafe_tag, is a mixed step,
which is where you can do your mixed-initiative tagging, either
annotating the step automatically, entirely from scratch, or
pretagging and correcting by hand. The fourth step here is a hand
annotation step; it has no engine.
Once you have your steps defined, you can organize them into
workflows:
<workflows>
<workflow name='Tokenless hand annotation'>
<step pretty_name='zone' name='whole_zone'/>
<step name='carafe_tag' pretty_name='hand tag' type='hand'/>
</workflow>
<workflow name='Review/repair'>
<step name='correct'/>
</workflow>
<workflow name='Demo' undoable="yes">
<step pretty_name='zone' name='whole_zone'/>
<step pretty_name='tokenize' name='carafe_tokenize'/>
<step pretty_name='tag' name='carafe_tag'/>
</workflow>
</workflows>
Within workflows, you can assign alternate names to your steps,
if you want to refer to them differently depending on what
workflow they're in. You can also further limit the step type; so
in the "Tokenless hand annotation" workflow here, we're using the
carafe_tag step, but we're limiting it to hand annotation only.
Finally, you can designate your workflow as undoable. By default,
you can't undo steps in MAT; this is because there's no global
order to steps, and if you undo steps in a given workflow, you may
leave your document in an odd state. For instance, if you apply
the "Demo" workflow here to a document, and then later load it
using the "Tokenless hand annotation" workflow and undo that
workflow, the "tokenize" step will still be done. We recommend
that if you want to make one of your workflows undoable, you
should select a workflow in your task which applies all of the
annotation sets in your task.
You can find additional details about the steps and workflows here.
In MAT, any workflow which contains at least one hand-annotatable
annotation step can be used as the basis of a workspace. So in order to use workspaces, you don't really need to
specify anything in your task.xml. However, there are things you
might want to say.
For instance, if you have multiple workflows which can form a
workspace, you may want to declare a default:
<workspaces default_config="Demo"/>
This default will be used by MATWorkspaceEngine
when it creates a workspace for your task, if you don't specify a
workspace configuration.
Note that in the context of workspaces, workflows are referred to
as workspace configurations. This is because you can
define named custom configurations of workflows to use for your
workspaces. Within these configurations, you can specify
parameters to the operations the workspace calls. Only some of the
operations accept custom parameters; see the workspace reference for details.
<workspaces default_config="Demo">
<workspace config_name="Custom Demo" workflow='Demo'>
<operation name="modelbuild">
<settings training_method="psa" max_iterations="10"/>
</operation>
</workspace>
</workspaces>
Use the MATManagePluginDirs tool to ensure that MAT knows about
your task directory. If <dir> is your task directory:
Unix:
% $MAT_PKG_HOME/bin/MATManagePluginDirs install <dir>
Windows native:
> %MAT_PKG_HOME%\bin\MATManagePluginDirs.cmd install <dir>
If you want to validate the task before you install it (if, e.g.,
your installation is shared and you don't want to mess up other
people who are currently using it), use the "validate" action:
Unix:
% $MAT_PKG_HOME/bin/MATManagePluginDirs validate <dir>
Windows native:
> %MAT_PKG_HOME%\bin\MATManagePluginDirs.cmd validate <dir>
Tasks are highly customizable, in ways that we'll never have
enough time to document. See the advanced documentation
for what we've been able to write down about these other
customizations, or work your way through the source code.