The task you use will contain a number of different sorts of
We'll talk about workspaces in a
bit, and some of these others you don't need to know about yet.
Right now, we're going to talk about engines, steps, and
Some of what you want to do in MAT requires an engine.
These engines are how MAT adds annotations for you, rather than
you adding them yourself by hand. Sometimes these engines are
trainable, and you can create models for them, as we did in tutorial 2; some of them are not,
like the engines that added zones and tokens for us in tutorial 1. Each task declares the
engines it needs, and tells MAT where to find the implementation
of that engine (and the model builder for that engine, if it's
You can find a list of the engines that MAT comes with here and here.
In MAT, each workflow consists of a series of steps.
These steps are global in the task; the workflows put subsets of
them in fixed orders, depending on your activity. In MAT 3.0, you
might encounter the following workflows:
There will be others, but these are the important ones.
Each of these steps knows which annotation sets it adds and/or
modifies, and which engine (if any) it can use to apply those
sets. In the UI, the step will allow you to add or modify only
those annotations and attributes which are in those sets. Each
step will be one of these four types:
This last type of step embodies the tag-a-little,
learn-a-little capability that MAT was originally developed
Some of the steps you'll find include:
||a step for zoning the document||This automated step adds zone annotations. The document zones are the areas that the subsequent steps should pay attention to. The simplest zone step simply marks the entire document as relevant.|
||a step for tokenizing the document||This automated step adds token annotations.
Tokens are basically words, and the trainable tagging engine
which comes with MAT uses tokens, rather than characters, as
its basis for analysis. If you're going to use the trainable
engine, either to build a model or to do automated
annotation, you have to have tokens. MAT comes with a
default tokenizer for English.
||a step for doing mixed-initiative annotation||This mixed step allows you to apply
previously-created models to your document to add content
annotations automatically. If you're in the UI, this step
also provides you with the opportunity to do hand annotation
de novo, or, if there's already a model for this
step, correct the output of automated tagging.
In tutorial 1, you saw how to
apply these steps in the MAT UI, and in tutorial
4, you saw how to apply them on the command line using the MATEngine tool.
In previous releases of MAT, tasks supported a global "undo" order across all workflows, so that "undo" in one workflow would maintain some task-wide consistency for documents. Because of the increased flexibility of steps and workflows in MAT 3.0, this global "undo" order is no longer imposed, and applying "undo" in workflows can leave documents in inconsistent states. In order to compensate for this issue, workflows, by default, are no longer undoable; the "retreat" buttons in the UI will not be present, for instance. You can specify workflows as undoable using the new "undoable" attribute of the <workflow> element in your task.xml file, but we encourage you to use it sparingly; you should only enable this feature for workflows which support all your tagging steps (content and otherwise).
As we saw in tutorial 2, tutorial 3, and tutorial 4, we can build a model
using hand-annotated or hand-corrected documents, and apply these
models to other, unannotated documents.
The training engine that comes with MAT, jCarafe, only works on what we've
called simple span annotations: spanned
annotations with labels or effective labels and no other
attributes. (The person who configured your task may have set up a
different engine, one which can build models for more complex
annotations; she'll tell you if she did that.) Approximately,
jCarafe analyzes the annotated documents and computes the
likelihoods of the various labels occurring in the various
contexts it encounters, as defined by a set of features (e.g.,
what the word is, whether it's capitalized, whether it's
alphanumeric, what words precede and follow) it extracts from the
documents it builds a model from. (The specific technique it uses
is conditional random fields.) You can then present a new
document to jCarafe, and based on the features it finds in that new
document, it will insert annotations in the locations the model
predicts should be there.
In general, the more documents you use to train an engine like
jCarafe, and the more exemplars of each annotation label it
finds in the training documents, and the greater the variety of
contexts those labels occur in in the training documents, the
better a job the engine will do of predicting where the
annotations should be in new, unannotated documents.
These engines are not likely to do a perfect job. There are ways
to improve the engine's performance other than providing more
data; these engines, including jCarafe, can be tuned in a wide
variety of ways. MAT doesn't help you do that. MAT is a tool for
corpus development and human annotator support; its goal is not to
help you produce the best automated tagging system. If you're
brave, you can tune jCarafe in
all sorts of ways, and MAT tries not to hinder your ability to do
that, if you know what you're doing; but it's not the point of the
The other thing you need to know is that while jCarafe only works
on simple span annotations, complex annotations won't cause it to
break; it'll just ignore everything it can't handle. So if your
task has spanless annotations, and spanned annotations with lots
of attributes, jCarafe will happily build a model for the spanned
labels alone, and you can use your complex annotated data to train
that simple model, and you can use that simple model to
automatically insert those simple span annotations, and insert the
remainder of the annotations and attributes by hand.