To customize MAT, you'll need to create a task
(encapsulated in a directory called a plugin; you can
install your task using the MATManagePluginDirs
utility, as you saw in tutorial 1).The
task you use will contain a number of different sorts of
information:
We'll talk about workspaces in a
bit, and some of these others you don't need to know about yet.
Right now, we're going to talk about engines, steps, and
workflows.
Some of what you want to do in MAT requires an engine.
These engines are how MAT adds annotations for you, rather than
you adding them yourself by hand. Sometimes these engines are
trainable, and you can create models for them, as we did in tutorial 2; some of them are not,
like the engines that added zones and tokens for us in tutorial 1. Each task declares the
engines it needs, and tells MAT where to find the implementation
of that engine (and the model builder for that engine, if it's
trainable).
You can find a list of the engines that MAT comes with here and here.
In MAT, each workflow consists of a series of steps.
These steps are global in the task; the workflows put subsets of
them in fixed orders, depending on your activity. In MAT 3.0, you
might encounter the following workflows:
There will be others, but these are the important ones.
Each of these steps knows which annotation sets it adds and/or
modifies, and which engine (if any) it can use to apply those
sets. In the UI, the step will allow you to add or modify only
those annotations and attributes which are in those sets. Each
step will be one of these four types:
This last type of step embodies the tag-a-little,
learn-a-little capability that MAT was originally developed
to provide.
Some of the steps you'll find include:
step name |
purpose |
details |
---|---|---|
"zone" |
a step for zoning the document | This automated step adds zone annotations. The document zones are the areas that the subsequent steps should pay attention to. The simplest zone step simply marks the entire document as relevant. |
"tokenize" |
a step for tokenizing the document | This automated step adds token annotations.
Tokens are basically words, and the trainable tagging engine
which comes with MAT uses tokens, rather than characters, as
its basis for analysis. If you're going to use the trainable
engine, either to build a model or to do automated
annotation, you have to have tokens. MAT comes with a
default tokenizer for English. |
"tag" |
a step for doing mixed-initiative annotation | This mixed step allows you to apply
previously-created models to your document to add content
annotations automatically. If you're in the UI, this step
also provides you with the opportunity to do hand annotation
de novo, or, if there's already a model for this
step, correct the output of automated tagging. |
In tutorial 1, you saw how to
apply these steps in the MAT UI, and in tutorial
4, you saw how to apply them on the command line using the MATEngine tool.
In previous releases of MAT, tasks supported a global "undo" order across all workflows, so that "undo" in one workflow would maintain some task-wide consistency for documents. Because of the increased flexibility of steps and workflows in MAT 3.0, this global "undo" order is no longer imposed, and applying "undo" in workflows can leave documents in inconsistent states. In order to compensate for this issue, workflows, by default, are no longer undoable; the "retreat" buttons in the UI will not be present, for instance. You can specify workflows as undoable using the new "undoable" attribute of the <workflow> element in your task.xml file, but we encourage you to use it sparingly; you should only enable this feature for workflows which support all your tagging steps (content and otherwise).
In MAT, engines can be trainable, i.e., we can build a model using hand-annotated or hand-corrected documents, and apply these models to other, unannotated documents.
In tutorial 2, tutorial 3, and tutorial 4, we saw how to use the
span annotation training engine that comes with MAT, jCarafe, to do this. jCarafe only
works on what we've called simple span annotations:
spanned annotations with labels or effective labels and no other
attributes. Approximately, jCarafe analyzes the annotated
documents and computes the likelihoods of the various labels
occurring in the various contexts it encounters, as defined by a
set of features (e.g., what the word is, whether it's capitalized,
whether it's alphanumeric, what words precede and follow) it
extracts from the documents it builds a model from. (The specific
technique it uses is conditional random fields.) You can
then present a new document to jCarafe, and based on the features
it finds in that new document, it will insert annotations in the
locations the model predicts should be there.
The jCarafe engine can also be used to learn the value of
annotation attributes whose values are drawn from a fixed set,
e.g., the sentiment polarity of a sentence (positive, negative,
neutral). To do this, jCarafe assembles features associated with
the items the attribute is associated with (e.g., which words are
in the sentence) and builds a classifier which attempts to create
a model that predicts the value of the attribute. (The specific
technique it uses is maximum entropy.) You can then
present a new document to jCarafe which has unmarked items in it
(e.g., sentences), and based on the features of each item, it will
set the attribute of the item based on what the model predicts the
value should be. We provide no tutorials for this use of jCarafe,
but we do have a representative sample task which illustrates it.
In general, the more documents you use to train an engine like
jCarafe, and the more exemplars of each annotation label
it finds in the training documents, and the greater the variety of
contexts those labels occur in in the training documents, the
better a job the engine will do of predicting where the
annotations should be in new, unannotated documents.
These engines are not likely to do a perfect job. There are ways
to improve the engine's performance other than providing more
data; these engines, including jCarafe, can be tuned in a wide
variety of ways. MAT doesn't help you do that. MAT is a tool for
corpus development and human annotator support; its goal is not to
help you produce the best automated tagging system. If you're
brave, you can tune jCarafe in
all sorts of ways, and MAT tries not to hinder your ability to do
that, if you know what you're doing; but it's not the point of the
toolkit.
The other thing you need to know is that while jCarafe only works
on simple span annotations, complex annotations won't cause it to
break; it'll just ignore everything it can't handle. So if your
task has spanless annotations, and spanned annotations with lots
of complex attributes, you can use jCarafe to build models for the
subset of the annotation task it can do that for, and you can use
your complex annotated data to train that simple model, and you
can use that simple model to automatically insert those elements,
and insert the remainder of the annotations and attributes by
hand.