It's not too difficult to define your own reader/writer. The file
MAT_PKG_HOME/lib/mat/python/MAT/XMLIO.py provides a good example.
The template looks like this:
from MAT.DocumentIO import declareDocumentIO, DocumentFileIO, SaveError
from MAT.Document import LoadError
class MyIO(DocumentFileIO):
def deserialize(self, s, annotDoc):
....
def writeToUnicodeString(self, annotDoc):
....
declareDocumentIO("my-io", MyIO, True, True)
The arguments to deserialize() are the input data from the file
and an annotated document to populate; see XMLIO.py for an example
of how to populate it. writeToUnicodeString() should return a
Unicode string which serializes the annotated document passed in.
In order to do this, you'll have to familiarize yourself with the
API for manipulating documents and annotations, which is not
documented but reasonably easy to understand from the source code.
Once you do all this, the file type name you assign to the class
via the call to declareDocumentIO() will be globally available.
You can also define command-line arguments which will be accepted by the tools when this file type is used. XMLIO.py also exemplifies this.
Finally, you can register a document convertor, typically a file containing document conversion XML, to apply whenever a document is read in the context of a particular task.
In the remainder of this document, we'll explore creating a reader for a fairly complex document format, the one associated with the brat annotation tool.
The annotation format for brat 1.2, which we digest here, looks like this:
T2 TITLE 1123 1131 Chairman
T4 IDEOLOGY 1147 1157 Republican
T6 PERSON 1132 1143 Lamar Smith
R1 Has-Ideology Arg1:T6 Arg2:T4
R2 Has-Ideology Arg1:T2 Arg2:T4
It also supports events, which have spans (as opposed to
relations, indicated by R elements here, which don't), and
supports string and boolean features on entities (indicated here
with T) and events. An additional challenge with brat is that it's
a standoff representation which does not include the document
signal. (The brat format also supports declaring the annotation
format, but we're going to ignore that here for the moment.)
We're only going to consider the reader here. We'll refer to the
following listing:
1 class BratIO(DocumentFileIO):
2
3 inputArgs = OptionTemplate([OpArgument("signal_file_location", hasArg = True,
4 help = "The directory where the signal files are located. If missing, the directory of the annotated file is assumed. The signal file is assumed to have a .txt extension instead of the .xml extension of the annotation file.")],
5 heading = "Options for brat input")
6
7 def __init__(self, signal_file_location = None, encoding = None,
8 annotation_conf_location = None, write_annotation_conf = False, **kw):
9 # Ignore the encoding.
10 DocumentFileIO.__init__(self, encoding = "utf-8", **kw)
11 self.signalFileLocation = signal_file_location
12 self.signalFileName = None
13
14 def readFromSource(self, source, **kw):
15 if (type(source) in (str, unicode)) and (source != "-"):
16 if os.path.splitext(source)[1] != ".ann":
17 raise LoadError, "brat annotation files must end with .ann"
18 if self.signalFileLocation is None:
19 self.signalFileLocation = os.path.dirname(source)
20 self.signalFileName = os.path.basename(source)
21 if self.signalFileName is None:
22 raise LoadError, "can't find the signal file"
23 return DocumentFileIO.readFromSource(self, source, **kw)
24
25 def deserialize(self, s, annotDoc):
26 if self.signalFileLocation is None:
27 raise LoadError, "Can't figure out where the signal is located"
28 # OK, now, try to find the signal.
29 import codecs
30 fp = codecs.open(os.path.join(self.signalFileLocation, os.path.splitext(self.signalFileName)[0] + ".txt"), "r", "utf8")
31 newSignal = fp.read()
32 fp.close()
33 if annotDoc.signal and (annotDoc.signal != newSignal):
34 raise LoadError, "signal from brat signal file doesn't match original signal"
35 annotDoc.signal = newSignal
36 annHash = {}
37 annotAttrs = []
38 stringAttrs = []
39 boolAttrs = []
40 equivSets = []
41 for line in [t.strip() for t in s.split("\n")]:
42 if (not line) or (line[0] == "#"):
43 continue
44 [t1, tRest] = line.split("\t", 1)
45 if t1[0] == "T":
46 spanReg = tRest.split("\t")[0]
47 [lab, startI, endI] = spanReg.split()
48 a = annotDoc.createAnnotation(int(startI), int(endI), lab)
49 annHash[t1] = a
50 a.setID(t1)
51 elif t1[0] in "RE":
52 # Take it apart.
53 rToks = tRest.split()
54 lab = rToks[0]
55 args = [t.split(":") for t in rToks[1:]]
56 if t1[0] == "R":
57 # Create it.
58 a = annotDoc.createSpanlessAnnotation(lab)
59 annHash[t1] = a
60 a.setID(t1)
61 idx = t1
62 else:
63 # It's an event, and it's defined elsewhere.
64 # But we still need to make sure we map the ID.
65 # This is because the attrs might refer to it.
66 [lab, idx] = lab.split(":")
67 annHash[t1] = annHash[idx]
68 annotAttrs.append((t1, idx, args))
69 elif t1[0] in "MA":
70 # Allow the case where the value has whitespace in it.
71 toks = tRest.split(None, 2)
72 if len(toks) == 3:
73 stringAttrs.append((toks[1], toks[0], toks[2]))
74 else:
75 boolAttrs.append((toks[1], toks[0]))
76 elif t1 == "*":
77 # What do I do with equivs? Establish an equiv relation, I
78 # suppose, with a single attribute.
79 equivSets.append(tRest.split()[1:])
80 # So now, everyone is created.
81 for (idx, attrName, val) in stringAttrs:
82 a = annHash[idx]
83 a.atype.ensureAttribute(attrName, aType = "string")
84 a[attrName] = val
85 for (idx, attrName) in boolAttrs:
86 a = annHash[idx]
87 a.atype.ensureAttribute(attrName, aType = "boolean")
88 a[attrName] = True
89 # brat can reuse the event triggers, but MAT can't.
90 eventTriggersSaturated = set()
91 for (eid, idx, args) in annotAttrs:
92 a = annHash[idx]
93 if idx in eventTriggersSaturated:
94 newA = annotDoc.createAnnotation(a.start, a.end, a.atype.lab)
95 newA.setID(eid)
96 for attr, val in zip(a.atype.attr_list, a.attrs):
97 if attr._typename_ != "annotation":
98 newA[attr.name] = val
99 a = newA
100 for [attrName, argIdx] in args:
101 a.atype.ensureAttribute(attrName, aType = "annotation")
102 a[attrName] = annHash[argIdx]
103 eventTriggersSaturated.add(idx)
104 if equivSets is not None:
105 atype = annotDoc.findAnnotationType("_Equiv", hasSpan = False)
106 atype.ensureAttribute("annots", aType = "annotation", aggregation = "set")
107 for equivSet in equivSets:
108 annotDoc.createSpanlessAnnotation("_Equiv", {"annots": AttributeValueSet([annHash[idx] for idx in equivSet])})
The MAT reader infrastructure does not yet provide built-in
support for dealing with external signals. Lines 3 - 35 provide a
pattern for handling this case. You provide a command-line option
for the location of the external signal (and, if necessary, you'd
probably want to add options for the encoding and how to compute
the signal pathname). You must specialize the readFromSource()
method and locate the signal file (note that lines 16 - 17 are
specific to brat, since we're looking for a specific file
extension which contains the annotations themselves). Finally, in
the beginning of the deserialize() method, you must read the
signal (lines 29 - 32), ensure that it doesn't clash with any
existing signal (lines 33 - 34), and set it in the document (line
35).
If the format you're reading allows annotation-valued attributes,
you need to do the deserialization in two steps: first, create
actual annotations for each annotation reference, and second, set
the annotation-valued attributes appropriately. Lines 36 - 75
perform this initial step.
For instance, at line 45, we recognize that the first character
of the element ID is "T", indicating a spanned entity, and so on
line 48, we create a new annotation, using the start and end
character indices in the annotation file. (It just so happens that
the brat offsets are identical to the MAT offsets. In some
formats, the end index might be one less than the MAT end index,
due to how the format is intended to do its counting; in other
formats, the counts may be in bytes instead of characters. So the
offset computation may be considerably more involved than it is
here.) Once we create the annotation, we store it in a dictionary
under its brat ID, and we assign this brat ID to the annotation on
line 50.
On lines 51 - 68, we deal with events and relations. In brat,
relations are spanless annotations, so we create such an
annotation on line 58, and record it as we did the entity. (Events
in brat, on the other hand, are links between spanned entities and
arguments, so we don't need to introduce a new annotation for
events.) In both these cases, we postpone recording the
annotation-valued attributes which serve as the arguments, since
we're not guaranteed of having created those yet; we create the
list for later augmentation on line 68.
On lines 69 - 75, we deal with attributes. Again, we don't add
them to the annotations; we record them for future augmentation.
Once we reach line 81, we've read all the brat entries, and we're
ready to add the attributes. On lines 81 - 84, we deal with the
string attributes; first, we ensure the attribute exists with the
proper type (line 83), and then we set the attribute (line 84). We
do the same for boolean attributes on lines 85 - 88.
At this point, we're ready to create the annotation-valued
attributes. Lines 93 - 99 deal with a feature of brat that MAT
does not have: because brat events are sets of arguments declared
against a spanned entity defined elsewhere, you can have multiple
events defined against the same spanned entity. Because MAT deals
with these as distinct event annotations, we must create copies
for those entities which have already been "claimed" by an event.
Once we've dealt with that detail, we handle the event attributes
very similarly to the others: we ensure it exists (line 101), and
then set the value (line 102), in this case pulling the annotation
from the dictionary of annotations we collected when we created
the annotations in step 2.
There are other details of the brat format that we've skipped over in this description; for instance, brat has a notion of entity equivalences which we model as spanless _Equiv annotations. But this overview of an example reader should provide guidance on how to implement these readers.