Tasks, engines, steps and workflows

Tasks
Engines
Steps
Steps and model training
Workflows
Undo
Steps and workflows with multiple content annotation sets
Other things you may find in tasks

Tasks

For the most part, you can't do anything substantial with the MAT toolkit without defining a task. We introduced tasks here. The fundamental components of a task are:

the languages that this task is defined for
the annotation sets which you can add using this task
the visual display of these annotations
the automated engines which you can use to add these annotations
the steps which you can apply (which can make use of automated engines if necessary)
the workflows, which consist of these steps
the workspace configurations which you can use with workflows to customize their behavior in workspaces
the scoring and comparison configurations for the annotations which drive the scoring, comparison and reconciliation tools

You only need to worry about the components you'll use - and sometimes, not even that. You'll almost always need to declare your annotation sets; but beyond that, much of your task configuration is optional. Here's a summary:

Component	When do you need it?
languages	always
annotation sets	always (that is, if you need a task; for MATTransducer, MATScore and MATReport, you don't need a task at all)
visual display	if you intend to do hand annotation or reconciliation or visual comparison
automated engines	if you intend to do zoning or tokenization in conjunction with hand annotation, or if you otherwise want to use an automated or trainable tagger
steps	if you intend to do hand or automated annotation
workflows	if you intend to do hand or automated annotation
workspace configurations	if you need to customize the behavior of workflows in workspaces; if you simply want to construct a workspace based on a workflow, nothing more needs to be specified
scoring and comparison configurations	if you can't or don't want to use the default scoring and comparison configurations in MAT (see the section on scoring and comparison for details)

Below, we'll talk about engines, steps, and workflows.

Engines

Engines implement the automated aspects of your processing steps. These engines may be trainable, or not; they may be wrappers for external tools, or they may be written directly in Python. Each engine has an implementation which is a Python class. We provide a number of useful engine implementations which you can use in your task. If you want to define your own engines, you'll have to consult the advanced topics.

Here are the engine implementations that MAT provides "out of the box":

Engine implementation name	common step name	Description
MAT.PluginMgr.WholeZoneStep	zone	This engine assigns a single zone annotation with label "zone" and attribute "region_type" with value "body", to the entire document. This engine has no options.
MAT.JavaCarafe.CarafeTokenizationStep	tokenize	This engine runs the jCarafe tokenizer on the relevant document, generating token annotations with label "lex" in such a way that the zone boundaries are not crossed. The options for this engine are described here.
MAT.JavaCarafe.CarafeTagStep	tag	This engine runs the jCarafe tagger, adding simple span annotations to the document. The models to be used with this engine are built using the MAT.JavaCarafe.CarafeModelBuilder model builder class. The options for this engine are described here.
MAT.JavaCarafe.JCarafeMaxentClassifierTagStep	attribute_tag	This engine runs the jCarafe maximum entropy decoder, adding attribute values to the document. the models to be used with this engine are built using the MAT.JavaCarafe.JCarafeMaxentClassifierModelBuilder model builder class. The options for this engine are described here.
MAT.PluginMgr.AlignStep	align	This engine is intended to work with documents which have been imported from other formats (e.g., XML inline), which have content annotations which may not align with token boundaries. This step aligns the content annotation boundaries with with the token boundaries by expanding the annotations to the nearest token boundaries. This alignment is expected in the UI annotation tool (and, in fact, by may trainable tagging engines, including jCarafe). Insert a step with this engine in your workflows which are intended to manage imported documents. The option for this engine is described immediate below.

See the sample tasks for detailed examples of how these engines are used in steps. See the jCarafe engine reference for a description of how the presence of only these engines limits the capabilities of MAT.

The running options these engines can bear can be specified in the task.xml file in the <run_settings> element of a <step> in a <workflow>, or in the invocation of the MAT engine.The one general-purpose step which has options is MAT.PluginMgr.AlignStep:

Command line option	XML attribute	Value	Description
--ensure_gold	ensure_gold	a comma-delimited string of true step names	If present, mark the annotation sets for the named steps as gold-standard data (annotator = "GOLD_STANDARD", status = "reconciled"). The step names must be the names of the steps as they're defined alone, not as pretty names as referenced in the context of a workflow.
--verbose	verbose		Prints a detailed summary of the alignment decisions made.
--half_verbose	half_verbose		Prints a shortened summary of the alignment decisions made.
--ignore_sets	ignore_sets	a comma-delimited string of annotation set or category names	By default, the aligner does not attempt to modify the boundaries of any of the predefined annotation categories (zone, admin, token). If you want to ignore other annotation categories or sets, use this option.

If an engine is trainable, it will be defined with an engine implementation and also a model implementation, like so, perhaps with a default model specified:

      <engine name='carafe_tag_engine'>
        <default_model>default_model</default_model>
        <model_config class='MAT.JavaCarafe.CarafeModelBuilder'/>
        <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
      </engine>

The engine itself can be limited to specific languages, and can be used as the engine in multiple steps.

Steps

Steps are atomic actions in your workflows. They describe how the engines are used (or not). There are four types of steps:

a hand annotation step, which has no engine associated with it and can only be applied in the UI
an automated step, which has an engine associated with it which must be used (i.e., you can't add or modify these annotations by hand)
a correction step, which has an engine associated with it whose output can be corrected if you're in the UI
a mixed step, which has an engine associated with it, in which you can either do hand annotation in the UI or automated annotation followed by correction

Any defined engine can be used in any of the last three steps; usually, the engine associated with a mixed step is trainable, but this isn't necessary.

Steps also specify the annotation sets or categories that the step adds or modifies. The annotation UI uses this information to determine which annotations to make available to the annotator in each annotation step.

Here's a simple example of how to define engines and the steps that use them:

    <engines>
      <engine name='carafe_tag_engine'>
        <model_config class='MAT.JavaCarafe.CarafeModelBuilder'/>
        <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
      </engine>
      <engine name='whole_zone_engine'>
        <step_config class='MAT.PluginMgr.WholeZoneStep'/>
      </engine>
      <engine name='carafe_tokenize_engine'>
        <step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
      </engine>
    </engines>
    <steps>
      <annotation_step engine='carafe_tag_engine' sets_added='category:content'
                       type='mixed' name='carafe_tag'/>
      <annotation_step engine='whole_zone_engine' sets_added='category:zone'
                       type='auto' name='whole_zone'/>
      <annotation_step engine='carafe_tokenize_engine'
                       sets_added='category:token' type='auto'
                       name='carafe_tokenize'/>
      <annotation_step type='hand' name='correct'
                       sets_modified='category:content'/>
    </steps>

You can find more details here and here.

Steps and model training

When MATModelBuilder builds a model, it has to know what trainable engine it's building the model for (because the task can have multiple trainable engines) and what set of annotations it's building a model for (because the task can have multiple annotation sets). Because steps carry information both about the engine and the annotation sets, MAT uses the steps to provide the appropriate context to the model builder. If the task has more than one step which has a trainable engine, you must specify a step to the model builder when you invoke it. Finally, the model builder needs to know what its target language is; and if the task supports multiple languages, you must specify this as well.

As we saw above, trainable engines can be assigned a default model. The default model is a pathname; if it's a relative pathname, it will be interpreted with respect to the directory the task.xml file resides in. Because tasks can support multiple languages and engines can support multiple trainable steps, the path itself will be suffixed with the step and language code of the context in which it's requested; e.g., if your default model is declared to be "default_model", and the task declares that English is available, and the model is trained in the context of the "carafe_tag" step, the default model will be saved as "default_model.carafe_tag_en" in your task directory.

In workspaces, the default model is ignored, and the language is established when the workspace is created, and the workspace knows which documents have completed which steps, and so all of these issues are already handled for the user.

Workflows

Once you have a set of steps to draw from, you can assemble them into workflows. Four extremely common and obvious workflows are

a workflow that zones a document, tokenizes it, and then provides an opportunity for mixed-initiative or hand annotation (called "Demo" in the "Named Entity" task)
a workflow for hand annotation without tokenization (called "Tokenless hand annotation" in the "Named Entity" task)
a workflow which has supports a final phase of human review and hand-correction (called "Review/repair" in the "Named Entity" task)

If you create other, custom steps, you may have other workflows.

Within workflows, you can assign a "pretty name" to the step, for instance:

      <workflow name='Tokenless hand annotation'>
        <step pretty_name='zone' name='whole_zone'/>
        <step name='carafe_tag' pretty_name='hand tag' type='hand'/>
      </workflow>

As you can see here, you can also narrow the step type; although carafe_tag is normally a mixed step, in the context of this workflow only its hand annotation capability is used (so the name is a bit of a misnomer in this usage).

The canonical step names are global to the task. In MAT 2.0, these names were used in the documents to track the progress of the annotation, but in MAT 3.0, they're not; instead, the names of the sets modified and applied are used. This allows mutliple steps, with different engines, to impose the same effect on a document (e.g., perhaps you have two different tagging engines whose performance you're comparing). In this way, MAT 3.0 provides a significantly cleaner division of labor between steps, engines and annotation sets.

Undo

One of the confounding aspects of the 2.0 implementation of workflows and steps was that the task required a global "undo" order, in order to ensure that when steps were undone in a workflow, all appropriate steps in the task were undone (e.g., if you undid tag and zone in a workflow which doesn't contain tokenization, and some workflow in the task contained tokenization between zone and tag, tokenization was undone as well). This global undo order was impossible to maintain in 3.0, and as a result, it has been abandoned. If you undo steps in a workflow, only those steps will be undone, and as a result, your documents can end up in unusual states (e.g., tokenized but not zoned). In order to compensate for this issue, in 3.0, workflows, by default, are not undoable; the "retreat" buttons in the UI will not be present, for instance. You can specify workflows as undoable using the new "undoable" attribute of the <workflow> element in your task.xml file, but we encourage you to use it sparingly; you should only enable this feature for workflows which support all your tagging steps (content and otherwise).

Steps and workflows with multiple content annotation sets

As we've seen, it's possible to have multiple content annotation sets. There's nothing special about content annotations in MAT; they just happen to be annotations which are associated with hand-annotatable steps. So if, for instance, you want to partition your span content annotations into two sets, as the Sample Relations task does, all you need to do is define a step for each annotation set you want to add:

      <annotation_step engine='carafe_tag_engine' sets_added='entities'
                       type='mixed' name='entity_tag'/>
      <annotation_step engine='carafe_tag_engine' sets_added='nationality'
                       type='mixed' name='nationality_tag'/>

Note that these steps have different names, but the same engine:

      <engine name='carafe_tag_engine'>
        <default_model>default_model</default_model>
        <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
          <build_settings training_method='psa' max_iterations='6'/>
        </model_config>
        <model_config config_name='alt_model_build'
                      class='MAT.JavaCarafe.CarafeModelBuilder'/>
        <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
      </engine>

Remember, the default model gets a step and language suffix, so if you build default models for each of the steps here, you'll end up with two model files, "default_model.entity_tag_en" and "default_model.nationality_tag_en" (assuming English is the sole language declared).

Other things you may find in tasks

In this page, we've described some of the most prominent customizations: defining engines, steps and workflows (we've discussed annotations elsewhere). There are many other things you can declare in your tasks:

You must declare your languages in your task.xml file. This covers the language name, language code, whether it's right to left or left to right, and what punctuation characters are used to terminate inferred tokens in certain cases where no tokens are available.
You can provide specialized Javascript and CSS code for the MAT UI. These customizations can be as simple as providing a custom editor for an annotation attribute, or as complex as a whole suite of overrides for the core UI behavior. These customizations are complex, and we will leave them undocumented in this release.
You can specify default settings for building models in the task.xml file. These settings are essentially the flags described in the documentation for MATModelBuilder.
You can declare settings for the operations in your workspaces in the task.xml file. These settings are essentially the flags described in the documentation for MATEngine and MATModelBuilder (depending on which one the operation uses).
You can customize how annotations are compared for scoring and comparison.
You can define custom steps in Python, and refer to them in the task.xml file. For hints on how to do this, see the section "Creating Your Own Steps" in the advanced task customization docs.
You can define new workspace folders, and refer to them and customize their behavior in the task.xml file. These customizations are still evolving, and we will leave them undocumented in this release.
You can customize the documentation that's visible via the MAT UI. These customizations are very complex, and we will leave them undocumented in this release.

For relevant examples of these, please consult "The sample tasks", "Creating a new task", and the documentation for the task XML.