Readers, writers, and XML handling

All the MAT tools are flexibly configured to use one of an extensible set of readers and writers. Currently, there are three reader/writer types: raw, mat-json, and xml-inline. There is also a fake-xml-inline reader. These types can be passed to tools like MATEngine. You may also find that your task has defined additional readers and writers; consult your task maintainer for details about these.

raw

For reading, a file of this type is treated as all signal. For write, the signal is extracted from the relevant annotated document. This reader/writer has no additional options. The default encoding for this reader/writer is UTF-8.

It is very important that you know what the encoding of your raw document is, and not just for MAT; any tool that reads raw text documents needs to know. If you're not sure, ask the person who provided the documents to you.

mat-json

This type designates the MAT-specific JSON document format (current version is 2). This reader/writer has no additional options. The only available encoding is UTF-8.

mat-json-v1

This type designates version 1 of the MAT-specific JSON document format. This type is available only as a writer (since mat-json reads both version 1 and version 2). It has no additional options. The only available encoding is UTF-8.

Reading and writing inline XML

In many circumstances, annotators will want to work with inline XML as a serialization of documents containing span annotations, either for compatibility with text editors, or interoperability with other tools or existing reference corpora, or for ease of textual visual inspection. MAT attempts to make this easy to do, but it's important to understand how MAT thinks about the relationship between inline XML and annotations.

You might encounter three types of documents which are said to be "XML":

You can also write valid XML documents using the xml-inline writer.

xml-inline

The XML inline reader accepts valid XML (and only valid XML). The default encoding is UTF-8.

When you use this reader, you may encounter XML elements which don't correspond to any known annotation type in your task. There are three options you have for how to handle this situation:

The unknown elements can be discarded. This is the default behavior. It may or may not be the behavior you want; think carefully and consider your other alternatives.

The unknown elements can be treated as annotations. For instance, you may have a file which contains structural corpus information, e.g., document boundaries such as <DOC>....</DOC>. You might wish these to be preserved as annotations, either via the UI, or on the command line, via the --xml_translate_all option. Although we make this option available, if you really want these annotations, you should declare these elements to be annotation types in your task. (Note, too, that in this particular example, MAT doesn't really "know" how to treat the material within each <DOC>...</DOC> interval as a separate document; so if you work with such corpora extensively, we recommend segmenting these multi-document corpus files into individual document files.)

The underlying signal can be treated as XML, either via the UI, or on the command line, via the --xml_input_is_overlay option. This final option is quite subtle. MAT's view of XML is that, by default, an XML document is a serialization of an annotated document: the elements are really ways of indicating the location of annotations, and the document signal is the unserialized XML character data. So "&amp;" in your XML document is the serialization of the ampersand character - this is what any well-behaved XML parser will give you. There's another way of looking at this, namely that the XML document is a set of annotations overlaid on another XML document. In other words, the signal of the document is what you get when you subtract the XML elements which correspond to annotation types in your task. This latter interpretation corresponds to your final option. So in this latter interpretation, "&amp;" and other entities will be retained in the signal, as will all elements which don't correspond to annotation types. If you choose this final option, you're really committing yourself to the subtracted signal being well-formed XML, so be careful. In general, this final option is seldom, if ever, what you want.

Finally, if you find yourself in a situation where the XML element or attribute names are intended to correspond to an annotation label or attribute name in your task, but they don't match exactly because of XML well-formedness requirements (e.g., the XML attribute name is "foo_attr" and your attribute name is "foo attr", containing a space), we recommend that you use this reader with MATTransducer, and use its document conversion capability to map the relevant XML element and attribute names to your annotation labels and attribute names.

The reader implements these alternatives via the following options:

Command line option
XML attribute
Value
Description
--xml_input_is_overlay
xml_input_is_overlay
"yes" (XML)
Normally, the XML reader will digest elements with the same name as a known annotation in the given task, and discard all other XML markup. If this flag is specified, the input XML will be treated as a mix of task-relevant annotations and underlying XML, and the extracted signal will be a well-formed XML file. Ignored if --xml_translate_all is specified.
--xml_translate_all
xml_translate_all
"yes" (XML)
Normally, the XML reader will digest elements with the same name as a known annotation in the given task, and discard all other XML markup. If this flag is specified, all elements will be converted to annotations; if a task is provided and the element label doesn't correspond to an annotation label, a new annotation type will be constructed.

If no task is provided (MATScore, MATReport and MATTransducer all can be used without tasks), the reader will set this flag internally.

The writer accepts the following options:

Command line option
XML attribute
Value
Description
--signal_is_xml
signal_is_xml
"yes" (XML)
Normally, the XML writer assumes that the underlying signal is not XML. If this flag is present, the underlying signal will be treated as a well-formed XML file when the output file is rendered. If the input file type is also 'xml-inline', use the --xml_input_is_overlay flag to control this setting instead.
--xml_output_tag_exclusions <tag,tag,...>
xml_output_tag_exclusions
A comma-delimited list of annotation labels to exclude from the XML output.
--xml_output_exclude_metadata
xml_output_exclude_metadata
"yes" (XML) Normally, the XML writer saves the document metadata inside an XML comment, so it can be read back in by the XML reader, and also renders the annotation and attribute type information as zero-length XML tags. This flag causes this metadata not to be written.

Every attempt is made to make XML read/write lossless with respect to the underlying document. However, this is not always possible, because MAT documents use standoff annotations, and any crossing dependencies will end up  generating malformed XML (e.g., <a>text<b>text</a>text</b>). You can use the --xml_output_tag_exclusions option to discard the offending annotation types.

When used as a writer, xml-inline will dump the annotation and attribute type information (unless --xml_output_exclude_metadata is used). This type information enables all attribute types to be read correctly when xml-inline is used as a reader, whether or not the same annotation task is used. This includes set and list types and annotation-valued attributes. When xml-inline is used as a reader, it looks for the appropriate representation of these types, and if you've provided a task, you can interpret these values correctly even if the document was not produced with the MAT xml-inline writer. We document these values here for completeness; you're welcome to try writing such a document with another tool and seeing if it MAT can read it, but we're not guaranteeing that they'll work.

Note: the xml-inline writer makes no attempt to modify the annotation labels or attribute names to ensure that the result is well-formed XML. E.g., if you have an attribute name contains a space, the resulting XML will preserve the space, resulting in ill-formed XML. If you need to create inline XML in this case, we recommend you save your document as MAT-JSON and use MATTransducer along with its document conversion capability to map these labels and attributes to well-formed XML element and attribute names.

xml-inline in the MAT UI

The xml-inline reader/writer is available as an option in the MAT UI when you load and save documents in file mode.

When you select "xml-inline" as your load option in the MAT UI, the "Load document" dialog looks like this:

[Load dialog]

The menu contains the three options described above. The link will bring you to this page, in a new window or tab.

When you select "xml-inline" as from the "Save" menu in your document window, you'll see the following popup:

[Save popup]

The "Underlying signal is XML" checkbox corresponds to the --signal_is_xml option; the "Annotation types to exclude" typein window corresponds to the --xml_output_tag_exclusions option; and the "Exclude MAT document metadata" checkbox corresponds to the --xml_output_exclude_metadata option. The link will bring you to this page, in a new window or tab.

fake-xml-inline

We commonly encounter data which is XML-"like", which simply has inline SGML-ish markup in a raw document, like so:

The <ORGANIZATION>Smith & Jones Corporation</ORGANIZATION> has announced its IPO.

In these documents, XML-significant characters "&<>" are not properly escaped, and there is no toplevel XML tag surrounding the entire document. The fake-xml-inline reader will search for patterns of the form <...>, and figures out whether the "tag" is a opening, closing, or zero-length tag. It translates attribute-value pairs of the SGML-ish opening tags into annotation attribute-value pairs. If it finds an attribute-value string which can't be parsed using XML-ish rules, it will treat the enclosing "tag" as part of the signal. The reader recognizes nested "tag"s correctly. Because this reader is targeted at extracting XML-ish markup from a raw document, there's no option equivalent to --xml_input_is_overlay, and unlike the xml-inline reader, all annotations are converted, not just the known ones.


As with the xml-inline reader, if you find yourself in a situation where the XML-ish element or attribute names are intended to correspond to an annotation label or attribute name in your task, but they don't match exactly because of XML-inspired well-formedness requirements (e.g., the XML attribute name is "foo_attr" and your attribute name is "foo attr", containing a space), we recommend that you use this reader with MATTransducer, and use its document conversion capability to map the relevant XML-ish element and attribute names to your annotation labels and attribute names.

The default encoding for this reader is UTF-8. There is no corresponding writer.

Defining your own reader/writer

If you have an idiosyncratic document format you want to use, It's not too difficult to define your own reader/writer.