Tasks, training and automated tagging

The task you use will contain a number of different sorts of information:

We'll talk about workspaces in a bit, and some of these others you don't need to know about yet. Right now, we're going to talk about engines, steps, and workflows.


Some of what you want to do in MAT requires an engine. These engines are how MAT adds annotations for you, rather than you adding them yourself by hand. Sometimes these engines are trainable, and you can create models for them, as we did in tutorial 2; some of them are not, like the engines that added zones and tokens for us in tutorial 1. Each task declares the engines it needs, and tells MAT where to find the implementation of that engine (and the model builder for that engine, if it's trainable).

You can find a list of the engines that MAT comes with here and here.

Steps and workflows

In MAT, each workflow consists of a series of steps. These steps are global in the task; the workflows put subsets of them in fixed orders, depending on your activity. In MAT 3.0, you might encounter the following workflows:

There will be others, but these are the important ones.

Each of these steps knows which annotation sets it adds and/or modifies, and which engine (if any) it can use to apply those sets. In the UI, the step will allow you to add or modify only those annotations and attributes which are in those sets. Each step will be one of these four types:

This last type of step embodies the tag-a-little, learn-a-little capability that MAT was originally developed to provide.

Some of the steps you'll find include:

step name
a step for zoning the document This automated step adds zone annotations. The document zones are the areas that the subsequent steps should pay attention to. The simplest zone step simply marks the entire document as relevant.
a step for tokenizing the document This automated step adds token annotations. Tokens are basically words, and the trainable tagging engine which comes with MAT uses tokens, rather than characters, as its basis for analysis. If you're going to use the trainable engine, either to build a model or to do automated annotation, you have to have tokens. MAT comes with a default tokenizer for English.
a step for doing mixed-initiative annotation This mixed step allows you to apply previously-created models to your document to add content annotations automatically. If you're in the UI, this step also provides you with the opportunity to do hand annotation de novo, or, if there's already a model for this step, correct the output of automated tagging.

In tutorial 1, you saw how to apply these steps in the MAT UI, and in tutorial 4, you saw how to apply them on the command line using the MATEngine tool.

A note about undo

In previous releases of MAT, tasks supported a global "undo" order across all workflows, so that "undo" in one workflow would maintain some task-wide consistency for documents. Because of the increased flexibility of steps and workflows in MAT 3.0, this global "undo" order is no longer imposed, and applying "undo" in workflows can leave documents in inconsistent states. In order to compensate for this issue, workflows, by default, are no longer undoable; the "retreat" buttons in the UI will not be present, for instance. You can specify workflows as undoable using the new "undoable" attribute of the <workflow> element in your task.xml file, but we encourage you to use it sparingly; you should only enable this feature for workflows which support all your tagging steps (content and otherwise).

Training and automated tagging

As we saw in tutorial 2, tutorial 3, and tutorial 4, we can build a model using hand-annotated or hand-corrected documents, and apply these models to other, unannotated documents.

The training engine that comes with MAT, jCarafe, only works on what we've called simple span annotations: spanned annotations with labels or effective labels and no other attributes. (The person who configured your task may have set up a different engine, one which can build models for more complex annotations; she'll tell you if she did that.) Approximately, jCarafe analyzes the annotated documents and computes the likelihoods of the various labels occurring in the various contexts it encounters, as defined by a set of features (e.g., what the word is, whether it's capitalized, whether it's alphanumeric, what words precede and follow) it extracts from the documents it builds a model from. (The specific technique it uses is conditional random fields.) You can then present a new document to jCarafe, and based on the features it finds in that new document, it will insert annotations in the locations the model predicts should be there.

In general, the more documents you use to train an engine like jCarafe, and the more exemplars of each annotation label it finds in the training documents, and the greater the variety of contexts those labels occur in in the training documents, the better a job the engine will do of predicting where the annotations should be in new, unannotated documents.

These engines are not likely to do a perfect job. There are ways to improve the engine's performance other than providing more data; these engines, including jCarafe, can be tuned in a wide variety of ways. MAT doesn't help you do that. MAT is a tool for corpus development and human annotator support; its goal is not to help you produce the best automated tagging system. If you're brave, you can tune jCarafe in all sorts of ways, and MAT tries not to hinder your ability to do that, if you know what you're doing; but it's not the point of the toolkit.

The other thing you need to know is that while jCarafe only works on simple span annotations, complex annotations won't cause it to break; it'll just ignore everything it can't handle. So if your task has spanless annotations, and spanned annotations with lots of attributes, jCarafe will happily build a model for the spanned labels alone, and you can use your complex annotated data to train that simple model, and you can use that simple model to automatically insert those simple span annotations, and insert the remainder of the annotations and attributes by hand.