For the most part, you can't do anything substantial with the MAT
toolkit without defining a task.
We introduced tasks here.
The fundamental components of a task are:
You only need to worry about the components you'll use - and
sometimes, not even that. You'll almost always need to declare
your annotation sets; but beyond that, much of your task
configuration is optional. Here's a summary:
Component |
When do you need it? |
---|---|
languages |
always |
annotation sets |
always (that is, if you need a task; for MATTransducer, MATScore and MATReport, you don't need a task
at all) |
visual display |
if you intend to do hand annotation or
reconciliation or visual comparison |
automated engines |
if you intend to do zoning or tokenization in
conjunction with hand annotation, or if you otherwise want
to use an automated or trainable tagger |
steps |
if you intend to do hand or automated
annotation |
workflows |
if you intend to do hand or automated
annotation |
workspace configurations |
if you need to customize the behavior of
workflows in workspaces; if you simply want to construct a
workspace based on a workflow, nothing more needs to be
specified |
scoring and comparison configurations |
if you can't or don't want to use the default
scoring and comparison configurations in MAT (see the
section on scoring and
comparison for details) |
Below, we'll talk about engines, steps, and workflows.
Engines implement the automated aspects of your processing steps.
These engines may be trainable, or not; they may be wrappers for
external tools, or they may be written directly in Python. Each
engine has an implementation which is a Python class. We provide a
number of useful engine implementations which you can use in your
task. If you want to define your own engines, you'll have to
consult the advanced
topics.
Here are the engine implementations that MAT provides "out of the
box":
Engine implementation name |
common step name |
Description |
---|---|---|
MAT.PluginMgr.WholeZoneStep |
zone |
This engine assigns a single
zone annotation with label "zone" and attribute
"region_type" with value "body", to the entire document. This engine has no options. |
MAT.JavaCarafe.CarafeTokenizationStep |
tokenize |
This engine runs the jCarafe
tokenizer on the relevant document, generating token
annotations with label "lex" in such a way that the zone
boundaries are not crossed. The options for this engine are described here. |
MAT.JavaCarafe.CarafeTagStep |
tag |
This engine runs the jCarafe
tagger, adding simple span annotations to the document. The
models to be used with this engine are built using the
MAT.JavaCarafe.CarafeModelBuilder model builder class. The options for this engine are described here. |
MAT.JavaCarafe.JCarafeMaxentClassifierTagStep |
attribute_tag |
This engine runs the jCarafe maximum entropy
decoder, adding attribute values to the document. the models
to be used with this engine are built using the
MAT.JavaCarafe.JCarafeMaxentClassifierModelBuilder model
builder class. The options for this engine are described here. |
MAT.PluginMgr.AlignStep |
align |
This engine is intended to
work with documents which have been imported from other
formats (e.g., XML inline), which have content annotations
which may not align with token boundaries. This step aligns
the content annotation boundaries with with the token
boundaries by expanding the annotations to the nearest token
boundaries. This alignment is expected in the UI annotation
tool (and, in fact, by may trainable tagging engines,
including jCarafe). Insert a step with this engine in your
workflows which are intended to manage imported documents. The option for this engine is described immediate below. |
See the sample tasks for detailed
examples of how these engines are used in steps. See the jCarafe
engine reference for a description of how the presence of
only these engines limits the capabilities of MAT.
The running options these engines can bear can be specified in
the task.xml file in the <run_settings> element of a
<step> in a <workflow>, or in the invocation of the
MAT engine.The one general-purpose step which has options is
MAT.PluginMgr.AlignStep:
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--ensure_gold |
ensure_gold |
a comma-delimited string of
true step names |
If present, mark the annotation
sets for the named steps as gold-standard data
(annotator = "GOLD_STANDARD", status = "reconciled"). The
step names must be the names of the steps as they're defined
alone, not as pretty names as referenced in the context of a
workflow. |
--verbose |
verbose |
Prints a detailed summary of the alignment decisions made. | |
--half_verbose |
half_verbose |
Prints a shortened summary of the alignment
decisions made. |
|
--ignore_sets |
ignore_sets |
a comma-delimited string of annotation set or
category names |
By default, the aligner does not attempt to
modify the boundaries of any of the predefined annotation
categories (zone, admin, token). If you want to ignore other
annotation categories or sets, use this option. |
If an engine is trainable, it will be defined with an engine
implementation and also a model implementation, like so, perhaps
with a default model specified:
<engine name='carafe_tag_engine'>
<default_model>default_model</default_model>
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'/>
<step_config class='MAT.JavaCarafe.CarafeTagStep'/>
</engine>
The engine itself can be limited to specific languages, and can
be used as the engine in multiple steps.
Steps are atomic actions in your workflows. They describe how the engines are used (or not). There are four types of steps:
Any defined engine can be used in any of the last three steps;
usually, the engine associated with a mixed step is trainable, but
this isn't necessary.
Steps also specify the annotation sets
or categories that the step adds or modifies. The annotation
UI uses this information to determine which annotations to make
available to the annotator in each annotation step.
Here's a simple example of how to define engines and the steps
that use them:
<engines>
<engine name='carafe_tag_engine'>
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'/>
<step_config class='MAT.JavaCarafe.CarafeTagStep'/>
</engine>
<engine name='whole_zone_engine'>
<step_config class='MAT.PluginMgr.WholeZoneStep'/>
</engine>
<engine name='carafe_tokenize_engine'>
<step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
</engine>
</engines>
<steps>
<annotation_step engine='carafe_tag_engine' sets_added='category:content'
type='mixed' name='carafe_tag'/>
<annotation_step engine='whole_zone_engine' sets_added='category:zone'
type='auto' name='whole_zone'/>
<annotation_step engine='carafe_tokenize_engine'
sets_added='category:token' type='auto'
name='carafe_tokenize'/>
<annotation_step type='hand' name='correct'
sets_modified='category:content'/>
</steps>
You can find more details here and here.
When MATModelBuilder builds a
model, it has to know what trainable engine it's building the
model for (because the task can have multiple trainable engines)
and what set of annotations it's building a model for (because the
task can have multiple annotation sets). Because steps carry
information both about the engine and the annotation sets, MAT
uses the steps to provide the appropriate context to the model
builder. If the task has more than one step which has a trainable
engine, you must specify a step to the model builder when you
invoke it. Finally, the model builder needs to know what its
target language is; and if the task supports multiple languages,
you must specify this as well.
As we saw above, trainable engines can be assigned a default
model. The default model is a pathname; if it's a relative
pathname, it will be interpreted with respect to the directory the
task.xml file resides in. Because tasks can support multiple
languages and engines can support multiple trainable steps, the
path itself will be suffixed with the step and language code of
the context in which it's requested; e.g., if your default model
is declared to be "default_model", and the task declares that
English is available, and the model is trained in the context of
the "carafe_tag" step, the default model will be saved as
"default_model.carafe_tag_en" in your task directory.
In workspaces, the default model is ignored, and the language is
established when the workspace is created, and the workspace knows
which documents have completed which steps, and so all of these
issues are already handled for the user.
Once you have a set of steps to draw from, you can assemble them
into workflows. Four extremely common and obvious workflows are
If you create other, custom steps, you may have other workflows.
Within workflows, you can assign a "pretty name" to the step, for
instance:
<workflow name='Tokenless hand annotation'>
<step pretty_name='zone' name='whole_zone'/>
<step name='carafe_tag' pretty_name='hand tag' type='hand'/>
</workflow>
As you can see here, you can also narrow the step type; although
carafe_tag is normally a mixed step, in the context of this
workflow only its hand annotation capability is used (so the name
is a bit of a misnomer in this usage).
The canonical step names are global to the task. In MAT 2.0,
these names were used in the documents to track the progress of
the annotation, but in MAT 3.0, they're not; instead, the names of
the sets modified and applied are used. This allows mutliple
steps, with different engines, to impose the same effect on a
document (e.g., perhaps you have two different tagging engines
whose performance you're comparing). In this way, MAT 3.0 provides
a significantly cleaner division of labor between steps, engines
and annotation sets.
One of the confounding aspects of the 2.0 implementation of workflows and steps was that the task required a global "undo" order, in order to ensure that when steps were undone in a workflow, all appropriate steps in the task were undone (e.g., if you undid tag and zone in a workflow which doesn't contain tokenization, and some workflow in the task contained tokenization between zone and tag, tokenization was undone as well). This global undo order was impossible to maintain in 3.0, and as a result, it has been abandoned. If you undo steps in a workflow, only those steps will be undone, and as a result, your documents can end up in unusual states (e.g., tokenized but not zoned). In order to compensate for this issue, in 3.0, workflows, by default, are not undoable; the "retreat" buttons in the UI will not be present, for instance. You can specify workflows as undoable using the new "undoable" attribute of the <workflow> element in your task.xml file, but we encourage you to use it sparingly; you should only enable this feature for workflows which support all your tagging steps (content and otherwise).
As we've seen,
it's possible to have multiple content annotation sets. There's
nothing special about content annotations in MAT; they just happen
to be annotations which are associated with hand-annotatable
steps. So if, for instance, you want to partition your span
content annotations into two sets, as the Sample Relations
task does, all you need to do is define a step for each annotation
set you want to add:
<annotation_step engine='carafe_tag_engine' sets_added='entities'
type='mixed' name='entity_tag'/>
<annotation_step engine='carafe_tag_engine' sets_added='nationality'
type='mixed' name='nationality_tag'/>
Note that these steps have different names, but the same engine:
<engine name='carafe_tag_engine'>
<default_model>default_model</default_model>
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
<build_settings training_method='psa' max_iterations='6'/>
</model_config>
<model_config config_name='alt_model_build'
class='MAT.JavaCarafe.CarafeModelBuilder'/>
<step_config class='MAT.JavaCarafe.CarafeTagStep'/>
</engine>
Remember, the default model gets a step and language suffix, so
if you build default models for each of the steps here, you'll end
up with two model files, "default_model.entity_tag_en" and
"default_model.nationality_tag_en" (assuming English is the sole
language declared).
In this page, we've described some of the most prominent
customizations: defining engines, steps and workflows (we've
discussed annotations
elsewhere). There are many other things you can declare in your
tasks:
For relevant examples of these, please consult "The sample tasks", "Creating a new task", and the
documentation for the task XML.