Python API

The MAT document object is accessible and manipulable in three programming languages: Java, JavaScript and Python. The Python API is the most extensive of the three, since almost all of the MAT tools and server-side services are implemented in Python. In this document, we provide a useable subset of the Python API which will allow you to create, manipulate, read and write MAT documents.

Loading the API

If MAT_PKG_HOME is a Python variable set to the appropriate value, you can load the API as follows:

import sys, os

# Insert MAT_PKG_HOME at the head of the path list
sys.path.insert(0, os.path.join(MAT_PKG_HOME, "lib", "mat", "python"))

import MAT

All the various operations you'll need are within the MAT package or its subpackages.

Accessing your task

Your task contains all sorts of information about the well-formedness of the annotations in documents you might create or read. You don't need to use this well-formedness information if you don't want to. So for instance, your task may specify that annotations with label "PERSON" have an attribute named "type" whose value is a string; if you use your well-formedness information, and you try to create or read a document with a PERSON annotation which contains a "type" attribute whose value is an integer, an error will be thrown. If you don't use your well-formedness information, the type of the first occurrence of each annotation attribute will determine its type for that annotation label in that document.

In other words, the only difference in the behavior of the API is when, and whether, it throws any well-formedness errors.

Assuming your tasks are registered in the normal way, here's how to access a task named "My Task". The task name you use is the same name you'd use for any of the MAT utilities.

pDir = MAT.PluginMgr.LoadPlugins()

# pDir implements a dictionary, so you can use it as a dictionary

try:
myTask = pDir["My Task"]
except KeyError:
pass

# Alternatively, the getTask method catches the error and
# returns None if the task isn't found

myTask = pDir.getTask("My Task")

Reading a document

To read a document from a file, you must instantiate a reader, using the same reader name you'd use with any of the MAT utilities (e.g., "mat-json", "xml-inline"). If you want the document to use your task's well-formedness information, you can pass the task to the reader when you instantiate it.

# Here's how you instantiate a reader without a task.

r = MAT.DocumentIO.getDocumentIO("mat-json")

# And here's how you do it with the task retrieved in the previous section.

r = MAT.DocumentIO.getDocumentIO("mat-json", task = myTask)

# Once you have this reader, you can use it to read a document.
# This method raises MAT.Document.LoadError.

d = r.readFromSource("/path/to/doc")

# You can also ready from standard input.

d = r.readFromSource("-")

# Readers have no document-level state; you can use the same reader
# to read an arbitrary number of documents.

d1 = r.readFromSource("/path/to/doc1")
d2 = r.readFromSource("/path/to/doc2")

The getDocumentIO() method also accepts the "encoding" keyword; this is ignored in the case of a reader like mat-json for which the encoding is always the same, but is quite useful when reading documents using the "raw" reader to create an unannotated MAT document:

r = MAT.DocumentIO.getDocumentIO("raw", encoding = "big5")

If you want to customize the instantiation of a reader, you can pass the XML reader attributes as keyword arguments to the instantiator. For instance, if you want to instantiate an xml-inline reader so that all XML elements in the document are converted to annotations, rather than just the ones defined in your task, you can do this:

r = MAT.DocumentIO.getDocumentIO("xml-inline", task = myTask, xml_translate_all = True)

Note that boolean options accept Python True/False as their values.

Creating a document

Alternatively, you can create a document from scratch, rather than loading it from a serialized source. You should provide a document signal (i.e., the text of the document) as a Python unicode object.

# You can create a document directly, without the task well-formedness information...

d = MAT.Document.AnnotatedDoc(signal = u"This is the document signal.")

# ...or via the task, which will provide the task well-formedness information.

d = myTask.newDocument(signal = u"This is the document signal.")

Adding and copying annotations

You can add annotations to your document, either spanned or spanless (there are separate methods for each).

The createAnnotation() method creates a spanned annotation. The type of the created object is MAT.Document.Annotation. The first two arguments are the start and end offsets of the span. MAT offsets

So In the signal "This is the document signal", the first word "This" has a start offset of 0 and an end offset of 4.

# Create a span annotation without any attributes.

spanA = d.createAnnotation(5, 10, "PERSON")

# The start and end offsets are available as instance variables.

startOffset = spanA.start
endOffset = spanA.end

# The label is converted to a MAT.Document._Atype instance,
# which is the same for all annotations with that label in the document.
# The original label is the value of the "lab" attribute of the atype.

originalLabel = spanA.atype.lab

Spanless annotations are created using the createSpanlessAnnotation() method. The type of the created object is MAT.Document.SpanlessAnnotation.

spanlessA = d.createSpanlessAnnotation("EVENT")
originalLabel = spanlessA.atype.lab

The _Atypes are the primary way that the task well-formedness conditions are implemented. It's not necessary to understand the details of these objects, and we won't discuss them further here.

A quirk that's important to be aware of for spanned annotations (athough you're extremely unlikely to encounter it) is that like many programming languages, Python usually doesn't implement true Unicode characters; instead, it essentially implements the UTF-16 encoding. This means, in normal circumstances, that if your document signal contains Unicode characters from above Unicode plane 0 (e.g., musical notation), your offsets will be wrong. We say "usually" because it's possible to build Python with 4-byte characters; the value of the Python variable sys.maxunicode essentially tells you whether this has happened. Such a Python installation will behave properly for all Unicode characters.

All annotations have a copy() method, but this method differs from the document methods that create the annotations in that the annotations aren't added to the document when they're copied. So you have to do this step separately:

# if your document is d.

newAnnot = spanA.copy()
d._addAnnotation(newAnnot)

Creating and accessing annotation attributes

You can provide attribute-value pairs as a dictionary when you create an annotation, or add them later. Annotations are handled identically in spanned and spanless annotations.

# Annotations implement the Python dictionary interface, so you can
# set or access attribute values using it.

spanA["type"] = "norm"
tVal = spanA["type"]
# Or, avoid the possible KeyError if the value is not set.
tVal = spanA.get("type")

# Create a span annotation with attributes.

spanA = d.createAnnotation(5, 10, "PERSON", attrs = {"type": "norm"})

Attribute-value pairs are deemed to be present in an annotation only if the attribute value is not None. You can erase an attribute-value pair by setting its value to None:

spanA["type"] = None

Note that due to a long-standing quirk in the Python implementation, attributes which have been set to None return None when accessed, but attributes which have never been set raise KeyError. The system treats these two cases as equivalent. The simplest way to deal with this quirk is to always use the get() method when you're not sure the value has been set.

Because the annotations implement the dictionary interface, you can retrieve all attribute value pairs and loop over them as follows:

for attr, val in spanA.items():
....

Attribute types

The various attribute values in the MAT document model are implemented in Python as follows:

MAT type
Python type
string
str, unicode
integer
int
float
float
boolean
bool (i.e., True, False)
annotation
MAT.Document.Annotation, MAT.Document.SpanlessAnnotation

Lists and sets of values do not use the normal Python list and set objects, and trying to use them will raise an error. Instead, you must use either MAT.Annotation.AttributeValueList or MAT.Annotation.AttributeValueSet. These special versions of the Python list and set objects are required to deal with various bookkeeping issues; for instance, if you use a set object as the value of an attribute, all the members of the set must be of the appropriate MAT type, and if you subsequently add an element to the set, the type restriction must hold for this new element as well.

MAT.Annotation.AttributeValueList and MAT.Annotation.AttributeValueSet are specializations of Python list and set, respectively, and support all the methods for those basic types.

Here's how you'd add two PERSON mentions as a set of annotations as the value of the "mentions" attribute of a COREF annotation:

m1 = d.createAnnotation(5, 15, "PERSON")
m2 = d.createAnnotation(100, 117, "PERSON")
mSet = MAT.Annotation.AttributeValueSet([m1, m2])
c1 = d.createSpanlessAnnotation("COREF", attrs = {"mentions": mSet})

# Adding another annotation to the set works fine...
mSet.add(d.createAnnotation(200, 206, "PERSON")

# ...but adding a string to the set would raise a MAT.Annotation.AnnotationError error.
mSet.add("this value won't be accepted")

Note, crucially, that the type of the value of an annotation-valued attribute is a Python annotation instance. When the document is serialized by the mat-json writer, this annotation is converted to a string ID, and that's what you'll see when you inspect a mat-json document; however, that's only for the purposes of serialization.

Retrieving sets of annotations

MAT documents support an extremely useful utility for retrieving annotations, the getAnnotations() method.

# Without arguments, getAnnotations() retrieves all the annotations in the document,
# including SEGMENTs, tokens, and zones.

dList = d.getAnnotations()

# You can limit the retrieval to specific labels.

dList = d.getAnnotations(atypes = ["PERSON", "LOCATION"])

# You can limit the retrieval to just the spanned annotations, or just the spanless ones.

dList = d.getAnnotations(spannedOnly = True)
dList = d.getAnnotations(spanlessOnly = True)

# You can order the spanned annotations by their start offset:

dList = d.getAnnotations(atypes = ["PERSON", "LOCATION"], ordered = True)

Removing annotations

You can remove annotations individually, or by label.

# Remove a single annotation.

d.removeAnnotation(spanA)

# Remove all "PERSON" annotations.

d.removeAnnotations(atypes = ["PERSON"])

# Remove all annotations, including tokens, zones, and SEGMENTs.

d.removeAnnotations()

A special case to keep in mind is that annotations which are the values of annotation-valued attributes cannot be removed. In other words, in our coref case above, the following will raise an error:

d.removeAnnotation(m2)

The API does not currently provide a way to force the annotation to be detached. There are two things you can do in this case. If you're not removing the annotation which points to the annotation you're removing (in our case, c1), you must detach the annotation by hand:

mSet.discard(m2)
d.removeAnnotation(m2)

On the other hand, if are also removing this "referencing" annotation, you can remove them together:

d.removeAnnotationGroup([m2, c1])

This operation is intentionally difficult; it's not clear to us that it should be easy to remove annotations which are referenced by other annotations.

Writing documents

You can use the MAT reader/writer system to write documents much as you can use it to read them. It's less important, in this case, whether you instantiate the writer using the task or not, but it's best practice to do the same thing you did when you read the document.

w = MAT.DocumentIO.getDocumentIO("mat-json", task = myTask)
w.writeToTarget(d, "/path/to/save/document/in")

You can also write to standard output using the "-" target, and pass XML writer parameters as keyword arguments to the instantiator.