The MAT document object is accessible and manipulable in three
programming languages: Java, JavaScript and Python. The Python API
is the most extensive of the three, since almost all of the MAT
tools and server-side services are implemented in Python. In this
document, we provide a useable subset of the Python API which will
allow you to create, manipulate, read and write MAT documents.
Note: Accessing the MAT library in version 3.3 has
changed.
If MAT_PKG_HOME is a Python variable set to the appropriate value, you
can load the API as follows:
import sys, os
# Insert MAT_PKG_HOME at the head of the path list
sys.path.insert(0, os.path.join(MAT_PKG_HOME, "lib", "mat", "python"))
import MATApplication
import MAT
All the various operations you'll need are within the MAT package
or its subpackages. The MATApplication package bootstraps the
loading of what will ultimately be a standalone Python MAT
library, accessible via pip.
If you've acquired the Python pip module for MAT, we assume that
you've installed it using "pip install" and it's available in your
environment in the normal way. In this case, all you need to do is
import MAT
However, accessing
your task will be different.
If you're loading
MAT from the full MAT application, your runtime environment will be set for
you by default. If you're loading MAT via the pip module, you'll
have to set up all the runtime variables by hand.
Runtime variables are managed through the MAT.Config.MATConfig
object. So, e.g., if you want to change the value of
SUBPROCESS_DEBUG:
MAT.Config.MATConfig.configureFromSetting("SUBPROCESS_DEBUG", "yes")
Your task contains all sorts of information about the well-formedness of the
annotations in documents you might create or read. You don't need
to use this well-formedness information if you don't want to. So
for instance, your task may specify that annotations with label
"PERSON" have an attribute named "type" whose value is a string;
if you use your well-formedness information, and you try to create
or read a document with a PERSON annotation which contains a
"type" attribute whose value is an integer, an error will be
thrown. If you don't use your well-formedness information,
the type of the first occurrence of each annotation attribute will
determine its type for that annotation label in that document.
In other words, the only difference in the behavior of the API is
when, and whether, it throws any well-formedness errors.
Assuming your tasks are registered in the normal way, here's how to
access a task named "My Task". The task name you use is the same
name you'd use for any of the MAT utilities.
pDir = MAT.PluginMgr.LoadPlugins()
# pDir implements a dictionary, so you can use it as a dictionary
try:
myTask = pDir["My Task"]
except KeyError:
pass
# Alternatively, the getTask method catches the error and
# returns None if the task isn't found
myTask = pDir.getTask("My Task")
In the pip module, accessing your task will be somewhat
different. There are two reasons for this. First, the normal means of registering
tasks is not available to you; and second, you may be trying to
load tasks which contain references to jCarafe, and the normal
load will fail because the relevant implementation classes are not
available in the pip module.
Your first decision point is whether you have a record of your
plugins (i.e., tasks). A plugin record is a text file, where each
line is an absolute path of a directory containing a task.xml
file. If you have such a record, and you've set the
MAT_PLUGIN_RECORD runtime
variable to the absolute path of that text file, you can use
LoadPlugins() as shown above. Otherwise, you'll need to
use a different function:
# Let's assume you have two tasks, one in /proj/mytask and one in /proj/myothertask.
pDir = MAT.PluginMgr.LoadPluginsFromDirs(["/proj/mytask", "/proj/myothertask"])
Your next decision is whether any of your tasks refer to the default jCarafe engine. This
engine is not delivered with the pip module, and since the plugin
loader attempts to find all the engine classes when it loads, you
won't be able to load your task. This might be tremendously
inconvenient, if you want to work with documents for that task but
don't have the full MAT installation available. To enable this,
both LoadPlugins() and LoadPluginsFromDirs()
accept a lazyClassLoad keyword which will postpone the
attempt to locate the engine classes until they're actually
accessed:
pDir = MAT.PluginMgr.LoadPluginsFromDirs(["/proj/mytask", "/proj/myothertask"], lazyClassLoad = True)
pDir = MAT.PluginMgr.LoadPlugins(lazyClassLoad = True)
Once the plugin directory has been loaded, you can access the
task in the directory as shown in the case of the full
application.
To read a document from a file, you must instantiate a reader, using the same
reader name you'd use with any of the MAT utilities (e.g.,
"mat-json", "xml-inline"). If you want the document to use your
task's well-formedness information, you can pass the task to the
reader when you instantiate it.
# Here's how you instantiate a reader without a task.
r = MAT.DocumentIO.getDocumentIO("mat-json")
# And here's how you do it with the task retrieved in the previous section.
r = MAT.DocumentIO.getDocumentIO("mat-json", task = myTask)
# Once you have this reader, you can use it to read a document.
# This method raises MAT.Document.LoadError.
d = r.readFromSource("/path/to/doc")
# You can also ready from standard input.
d = r.readFromSource("-")
# Readers have no document-level state; you can use the same reader
# to read an arbitrary number of documents.
d1 = r.readFromSource("/path/to/doc1")
d2 = r.readFromSource("/path/to/doc2")
The getDocumentIO() method also accepts the "encoding" keyword;
this is ignored in the case of a reader like mat-json for which
the encoding is always the same, but is quite useful when reading
documents using the "raw" reader to create an unannotated MAT
document:
r = MAT.DocumentIO.getDocumentIO("raw", encoding = "big5")
If you want to customize the instantiation of a reader, you can
pass the XML reader attributes as keyword arguments to the
instantiator. For instance, if you want to instantiate an xml-inline reader
so that all XML elements in the document are converted to
annotations, rather than just the ones defined in your task, you
can do this:
r = MAT.DocumentIO.getDocumentIO("xml-inline", task = myTask, xml_translate_all = True)
Note that boolean options accept Python True/False as their
values.
Alternatively, you can create a document from scratch, rather
than loading it from a serialized source. You should provide a
document signal (i.e., the text of the document) as a Python
unicode object.
# You can create a document directly, without the task well-formedness information...
d = MAT.Document.AnnotatedDoc(signal = u"This is the document signal.")
# ...or via the task, which will provide the task well-formedness information.
d = myTask.newDocument(signal = u"This is the document signal.")
You can add annotations to your document, either spanned or
spanless (there are separate methods for each).
The createAnnotation() method creates a spanned annotation. The
type of the created object is MAT.Document.Annotation. The first
two arguments are the start and end offsets of the span. MAT
offsets
So In the signal "This is the document signal", the first word
"This" has a start offset of 0 and an end offset of 4.
# Create a span annotation without any attributes.
spanA = d.createAnnotation(5, 10, "PERSON")
# The start and end offsets are available as instance variables.
startOffset = spanA.start
endOffset = spanA.end
# The label is converted to a MAT.Document._Atype instance,
# which is the same for all annotations with that label in the document.
# The original label is the value of the "lab" attribute of the atype.
originalLabel = spanA.atype.lab
Spanless annotations are created using the
createSpanlessAnnotation() method. The type of the created object
is MAT.Document.SpanlessAnnotation.
spanlessA = d.createSpanlessAnnotation("EVENT")
originalLabel = spanlessA.atype.lab
The _Atypes are the primary way that the task well-formedness
conditions are implemented. It's not necessary to understand the
details of these objects, and we won't discuss them further here.
A quirk that's important to be aware
of for spanned annotations (athough you're extremely unlikely to
encounter it) is that like
many programming languages, Python usually doesn't
implement true Unicode characters; instead, it essentially
implements the UTF-16 encoding. This means, in normal
circumstances, that if your document signal contains Unicode
characters from above Unicode plane 0 (e.g., musical notation),
your offsets will be wrong. We say "usually" because it's possible
to build Python with 4-byte characters; the value of the Python
variable sys.maxunicode essentially tells you whether this has
happened. Such a Python installation will behave properly for all
Unicode characters.
All annotations have a copy() method, but this method differs
from the document methods that create the annotations in that the
annotations aren't added to the document when they're copied. So
you have to do this step separately:
# if your document is d.
newAnnot = spanA.copy()
d._addAnnotation(newAnnot)
You can provide attribute-value pairs as a dictionary when you
create an annotation, or add them later. Annotations are handled
identically in spanned and spanless annotations.
# Annotations implement the Python dictionary interface, so you can
# set or access attribute values using it.
spanA["type"] = "norm"
tVal = spanA["type"]
# Or, avoid the possible KeyError if the value is not set.
tVal = spanA.get("type")
# Create a span annotation with attributes.
spanA = d.createAnnotation(5, 10, "PERSON", attrs = {"type": "norm"})
Attribute-value pairs are deemed to be present in an annotation
only if the attribute value is not None. You can erase an
attribute-value pair by setting its value to None:
spanA["type"] = None
Note that due to a long-standing quirk in the Python
implementation, attributes which have been set to None return None
when accessed, but attributes which have never been set raise
KeyError. The system treats these two cases as equivalent. The
simplest way to deal with this quirk is to always use the get()
method when you're not sure the value has been set.
Because the annotations implement the dictionary interface, you
can retrieve all attribute value pairs and loop over them as
follows:
for attr, val in spanA.items():
....
The various attribute values in the MAT document model are
implemented in Python as follows:
MAT type |
Python type |
---|---|
string |
str, unicode |
integer |
int |
float |
float |
boolean |
bool (i.e., True, False) |
annotation |
MAT.Document.Annotation,
MAT.Document.SpanlessAnnotation |
Lists and sets of values do not use the normal Python
list and set objects, and trying to use them will raise an error.
Instead, you must use either MAT.Annotation.AttributeValueList or
MAT.Annotation.AttributeValueSet. These special versions of the
Python list and set objects are required to deal with various
bookkeeping issues; for instance, if you use a set object as the
value of an attribute, all the members of the set must be of the
appropriate MAT type, and if you subsequently add an element to
the set, the type restriction must hold for this new element as
well.
MAT.Annotation.AttributeValueList and
MAT.Annotation.AttributeValueSet are specializations of Python
list and set, respectively, and support all the methods for those
basic types.
Here's how you'd add two PERSON mentions as a set of annotations
as the value of the "mentions" attribute of a COREF annotation:
m1 = d.createAnnotation(5, 15, "PERSON")
m2 = d.createAnnotation(100, 117, "PERSON")
mSet = MAT.Annotation.AttributeValueSet([m1, m2])
c1 = d.createSpanlessAnnotation("COREF", attrs = {"mentions": mSet})
# Adding another annotation to the set works fine...
mSet.add(d.createAnnotation(200, 206, "PERSON")
# ...but adding a string to the set would raise a MAT.Annotation.AnnotationError error.
mSet.add("this value won't be accepted")
Note, crucially, that the type of the value of an
annotation-valued attribute is a Python annotation instance. When
the document is serialized by the mat-json writer, this annotation
is converted to a string ID, and that's what you'll see when you
inspect a mat-json document; however, that's only for the purposes
of serialization.
MAT documents support an extremely useful utility for retrieving
annotations, the getAnnotations() method.
# Without arguments, getAnnotations() retrieves all the annotations in the document,
# including SEGMENTs, tokens, and zones.
dList = d.getAnnotations()
# You can limit the retrieval to specific labels.
dList = d.getAnnotations(atypes = ["PERSON", "LOCATION"])
# You can limit the retrieval to just the spanned annotations, or just the spanless ones.
dList = d.getAnnotations(spannedOnly = True)
dList = d.getAnnotations(spanlessOnly = True)
# You can order the spanned annotations by their start offset:
dList = d.getAnnotations(atypes = ["PERSON", "LOCATION"], ordered = True)
You can remove annotations individually, or by label.
# Remove a single annotation.
d.removeAnnotation(spanA)
# Remove all "PERSON" annotations.
d.removeAnnotations(atypes = ["PERSON"])
# Remove all annotations, including tokens, zones, and SEGMENTs.
d.removeAnnotations()
A special case to keep in mind is that annotations which are the
values of annotation-valued attributes cannot be removed. In other
words, in our coref case above, the following will raise an error:
d.removeAnnotation(m2)
The API does not currently provide a way to force the annotation
to be detached. There are two things you can do in this case. If
you're not removing the annotation which points to the annotation
you're removing (in our case, c1), you must detach the annotation
by hand:
mSet.discard(m2)
d.removeAnnotation(m2)
On the other hand, if are also removing this "referencing"
annotation, you can remove them together:
d.removeAnnotationGroup([m2, c1])
This operation is intentionally difficult; it's not clear to us
that it should be easy to remove annotations which are referenced
by other annotations.
You can use the MAT reader/writer
system to write documents much as you can use it to read them. It's less important,
in this case, whether you instantiate the writer using the task or
not, but it's best practice to do the same thing you did when you
read the document.
w = MAT.DocumentIO.getDocumentIO("mat-json", task = myTask)
w.writeToTarget(d, "/path/to/save/document/in")
You can also write to standard output using the "-" target, and
pass XML writer parameters as keyword arguments to the
instantiator.