MAT
The Annotation Toolkit

The MITRE Annotation Toolkit (MAT) is a suite of tools which can be used for automated and human tagging of annotations.

If you don't know what annotation is, here's a quick and dirty definition: annotation is a process, used mostly by researchers in natural language processing, of enhancing documents with information about the various phrase types the documents contain. So if the document contains the sentence "John F. Kennedy visited Guam", the document would be enhanced with the information that "John F. Kennedy" is a person, and "Guam" is a location. The MAT toolkit provides support for defining the types of information that should be added, and for adding that information manually or automatically.

MAT supports both UI interaction and command-line interaction, and provides various levels of control over the overall annotation process. It can be customized for specific tasks (e.g., named entity identification, de-identification of medical records). The goal of MAT is not to help you configure your training engine (in the default case, the Carafe CRF system) to achieve the best possible performance on your data. MAT is for "everything else": all the tools you end up wishing you had.

MAT contains:

Because the MAT components are loosely coupled, you can do a whole range of things with MAT, like

Some of these things require a little work on your part, but MAT's value added is considerable.

MAT's design targets the tag-a-little, learn-a-little (TALLAL) loop. The TALLAL loop is used for jointly creating corpora of correctly-annotated documents, side by side with a model for automatically adding these annotations to documents. In the default case, the user begins by hand-annotating a group of documents, and using the trainer to create a model for automatic tagging. The user then uses the model to create automatically annotated documents, which can then be hand-corrected and added to the corpus of documents available as inputs to model creation. In this way, the user expands the corpus, while creating a better-fitting model, and reducing effort on each iteration as the model improves. At various points, the user can also use the correctly-annotated corpus to assess the accuracy of the model, using the experiment engine.

MAT's original target application was MIST, a toolkit for deidentification of free-text medical documents. Our goal was to put this TALLAL capability in the hands of medical professionals who were neither linguists nor computer scientists.

You can download MAT here. The current version of MAT is 3.3.

While MITRE has no control over what you do with MAT, we do ask, as a courtesy, that you let us know what novel uses you're applying it to.

MITRE has assigned a BSD license to its contributions to MAT.

MAT is distributed with a large number of open-source components which bear similarly liberal licenses. MAT requires specific versions of some of these tools, and in some cases has modified those packages to enhance their functionality. The packages and their licenses are:

MAT is a research prototype. It is not intended to be enterprise-ready: it is not internationalized, it is not configured to work with enterprise Web capabilities like Tomcat or Apache, it has no real security model, and it is not designed for 24/7 availability or replication.

On the other hand, MAT has been under development for several years, and has been used successfully by a number of MITRE internal projects and sponsors. We hope you find it useful.

No. MITRE has set up a users mailing list for discussion about MAT, but the MAT team does not have the resources to provide open-ended long-term support for open-source packages. Members of the MAT team may monitor this list, and may occasionally comment, but cannot be relied upon for help or advice.

If you're interested in the technical details of MAT and how to use it, you can read the on-line documentation (best viewed with Firefox).