Tasks, engines, steps and workflows

Tasks

For the most part, you can't do anything substantial with the MAT toolkit without defining a task. We introduced tasks here. The fundamental components of a task are:

You only need to worry about the components you'll use - and sometimes, not even that. You'll almost always need to declare your annotation sets; but beyond that, much of your task configuration is optional. Here's a summary:

Component
When do you need it?
languages
always
annotation sets
always (that is, if you need a task; for MATTransducer, MATScore and MATReport, you don't need a task at all)
visual display
if you intend to do hand annotation or reconciliation or visual comparison
automated engines
if you intend to do zoning or tokenization in conjunction with hand annotation, or if you otherwise want to use an automated or trainable tagger
steps
if you intend to do hand or automated annotation
workflows
if you intend to do hand or automated annotation
workspace configurations
if you need to customize the behavior of workflows in workspaces; if you simply want to construct a workspace based on a workflow, nothing more needs to be specified
scoring and comparison configurations
if you can't or don't want to use the default scoring and comparison configurations in MAT (see the section on scoring and comparison for details)

Below, we'll talk about engines, steps, and workflows.

Engines

Engines implement the automated aspects of your processing steps. These engines may be trainable, or not; they may be wrappers for external tools, or they may be written directly in Python. Each engine has an implementation which is a Python class. We provide a number of useful engine implementations which you can use in your task. If you want to define your own engines, you'll have to consult the advanced topics.

Here are the engine implementations that MAT provides "out of the box":

Engine implementation name
common step name
Description
MAT.PluginMgr.WholeZoneStep
zone
This engine assigns a single zone annotation with label "zone" and attribute "region_type" with value "body", to the entire document.

This engine has no options.
MAT.JavaCarafe.CarafeTokenizationStep
tokenize
This engine runs the jCarafe tokenizer on the relevant document, generating token annotations with label "lex" in such a way that the zone boundaries are not crossed.

The options for this engine are described here.
MAT.JavaCarafe.CarafeTagStep
tag
This engine runs the jCarafe tagger, adding content tags to the document.

The options for this engine are described here.
MAT.PluginMgr.AlignStep
align
This engine is intended to work with documents which have been imported from other formats (e.g., XML inline), which have content annotations which may not align with token boundaries. This step aligns the content annotation boundaries with with the token boundaries by expanding the annotations to the nearest token boundaries. This alignment is expected in the UI annotation tool (and, in fact, by may trainable tagging engines, including jCarafe). Insert a step with this engine in your workflows which are intended to manage imported documents.

The option for this engine is described immediate below.

See the sample tasks for detailed examples of how these engines are used in steps.

The running options these engines can bear can be specified in the task.xml file in the <run_settings> element of a <step> in a <workflow>, or in the invocation of the MAT engine.The one general-purpose step which has options is MAT.PluginMgr.AlignStep:

Command line option
XML attribute
Value
Description
--ensure_gold
ensure_gold
a comma-delimited string of true step names
If present, mark the annotation sets for the named steps as gold-standard data (annotator = "GOLD_STANDARD", status = "reconciled"). The step names must be the names of the steps as they're defined alone, not as pretty names as referenced in the context of a workflow.
--verbose
verbose

Prints a detailed summary of the alignment decisions made.
--half_verbose
half_verbose

Prints a shortened summary of the alignment decisions made.
--ignore_sets
ignore_sets
a comma-delimited string of annotation set or category names
By default, the aligner does not attempt to modify the boundaries of any of the predefined annotation categories (zone, admin, token). If you want to ignore other annotation categories or sets, use this option.

If an engine is trainable, it will be defined with an engine implementation and also a model implementation, like so, perhaps with a default model specified:

      <engine name='carafe_tag_engine'>
<default_model>default_model</default_model>
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'/>
<step_config class='MAT.JavaCarafe.CarafeTagStep'/>
</engine>

The engine itself can be limited to specific languages, and can be used as the engine in multiple steps.

Steps

Steps are atomic actions in your workflows. They describe how the engines are used (or not). There are four types of steps:

Any defined engine can be used in any of the last three steps; usually, the engine associated with a mixed step is trainable, but this isn't necessary.

Steps also specify the annotation sets or categories that the step adds or modifies. The annotation UI uses this information to determine which annotations to make available to the annotator in each annotation step.

Here's a simple example of how to define engines and the steps that use them:

    <engines>
<engine name='carafe_tag_engine'>
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'/>
<step_config class='MAT.JavaCarafe.CarafeTagStep'/>
</engine>
<engine name='whole_zone_engine'>
<step_config class='MAT.PluginMgr.WholeZoneStep'/>
</engine>
<engine name='carafe_tokenize_engine'>
<step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
</engine>
</engines>
<steps>
<annotation_step engine='carafe_tag_engine' sets_added='category:content'
type='mixed' name='carafe_tag'/>
<annotation_step engine='whole_zone_engine' sets_added='category:zone'
type='auto' name='whole_zone'/>
<annotation_step engine='carafe_tokenize_engine'
sets_added='category:token' type='auto'
name='carafe_tokenize'/>
<annotation_step type='hand' name='correct'
sets_modified='category:content'/>
</steps>

You can find more details here and here.

Steps and model training

When MATModelBuilder builds a model, it has to know what trainable engine it's building the model for (because the task can have multiple trainable engines) and what set of annotations it's building a model for (because the task can have multiple annotation sets). Because steps carry information both about the engine and the annotation sets, MAT uses the steps to provide the appropriate context to the model builder. If the task has more than one step which has a trainable engine, you must specify a step to the model builder when you invoke it. Finally, the model builder needs to know what its target language is; and if the task supports multiple languages, you must specify this as well.

As we saw above, trainable engines can be assigned a default model. The default model is a pathname; if it's a relative pathname, it will be interpreted with respect to the directory the task.xml file resides in. Because tasks can support multiple languages and engines can support multiple trainable steps, the path itself will be suffixed with the step and language code of the context in which it's requested; e.g., if your default model is declared to be "default_model", and the task declares that English is available, and the model is trained in the context of the "carafe_tag" step, the default model will be saved as "default_model.carafe_tag_en" in your task directory.

In workspaces, the default model is ignored, and the language is established when the workspace is created, and the workspace knows which documents have completed which steps, and so all of these issues are already handled for the user.

Workflows

Once you have a set of steps to draw from, you can assemble them into workflows. Four extremely common and obvious workflows are

If you create other, custom steps, you may have other workflows.

Within workflows, you can assign a "pretty name" to the step, for instance:

      <workflow name='Tokenless hand annotation'>
<step pretty_name='zone' name='whole_zone'/>
<step name='carafe_tag' pretty_name='hand tag' type='hand'/>
</workflow>

As you can see here, you can also narrow the step type; although carafe_tag is normally a mixed step, in the context of this workflow only its hand annotation capability is used (so the name is a bit of a misnomer in this usage).

The canonical step names are global to the task. In MAT 2.0, these names were used in the documents to track the progress of the annotation, but in MAT 3.0, they're not; instead, the names of the sets modified and applied are used. This allows mutliple steps, with different engines, to impose the same effect on a document (e.g., perhaps you have two different tagging engines whose performance you're comparing). In this way, MAT 3.0 provides a significantly cleaner division of labor between steps, engines and annotation sets.

Undo

One of the confounding aspects of the 2.0 implementation of workflows and steps was that the task required a global "undo" order, in order to ensure that when steps were undone in a workflow, all appropriate steps in the task were undone (e.g., if you undid tag and zone in a workflow which doesn't contain tokenization, and some workflow in the task contained tokenization between zone and tag, tokenization was undone as well). This global undo order was impossible to maintain in 3.0, and as a result, it has been abandoned. If you undo steps in a workflow, only those steps will be undone, and as a result, your documents can end up in unusual states (e.g., tokenized but not zoned). In order to compensate for this issue, in 3.0, workflows, by default, are not undoable; the "retreat" buttons in the UI will not be present, for instance. You can specify workflows as undoable using the new "undoable" attribute of the <workflow> element in your task.xml file, but we encourage you to use it sparingly; you should only enable this feature for workflows which support all your tagging steps (content and otherwise).

Steps and workflows with multiple content annotation sets

As we've seen, it's possible to have multiple content annotation sets. There's nothing special about content annotations in MAT; they just happen to be annotations which are associated with hand-annotatable steps. So if, for instance, you want to partition your span content annotations into two sets, as the Sample Relations task does, all you need to do is define a step for each annotation set you want to add:

      <annotation_step engine='carafe_tag_engine' sets_added='entities'
type='mixed' name='entity_tag'/>
<annotation_step engine='carafe_tag_engine' sets_added='nationality'
type='mixed' name='nationality_tag'/>

Note that these steps have different names, but the same engine:

      <engine name='carafe_tag_engine'>
<default_model>default_model</default_model>
<model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
<build_settings training_method='psa' max_iterations='6'/>
</model_config>
<model_config config_name='alt_model_build'
class='MAT.JavaCarafe.CarafeModelBuilder'/>
<step_config class='MAT.JavaCarafe.CarafeTagStep'/>
</engine>

Remember, the default model gets a step and language suffix, so if you build default models for each of the steps here, you'll end up with two model files, "default_model.entity_tag_en" and "default_model.nationality_tag_en" (assuming English is the sole language declared).

Other things you may find in tasks

In this page, we've described some of the most prominent customizations: defining engines, steps and workflows (we've discussed annotations elsewhere). There are many other things you can declare in your tasks:

For relevant examples of these, please consult "The sample tasks", "Creating a new task", and the documentation for the task XML.