Document conversion XML reference

The MAT transducer supports a very sophisticated conversion language which supports an enormous range of operations on annotations: changing labels, demoting and promoting labels, extracting spans, changing attribute values, etc. Using this language, you can do an enormous range of things to annotations without ever writing any code.

Overview

The general format of the XML is

<instructions>
<...selector ...>
<...operation.../>
<...selector ...>
...
</...selector...>
</...selector...>
<...operation.../>
...
</instructions>

The selectors or operations are processed in order, and each is applied to the current state of the annotations. You can have as many selectors or operations as you want. For instance, let's say you want to change the name of the PER annotation to PERSON, and the ORG annotation to ORGANIZATION, and you want to rename the nomtype attribute on all of them to NOMTYPE:

<instructions>
<labels source="PER">
<map target="PERSON"/>
</labels>
<labels source="ORG">
<map target="ORGANIZATION"/>
</labels>
<labels source_re="ORGANIZATION|PERSON">
<map_attr source="nomtype" target="NOMTYPE"/>
</labels>
</instructions>

Selectors and selector scope

There are three selectors: <labels>, which establishes an annotation scope; <attrs>, which occurs only within the <label> scope and establishes an attribute scope, and <values>, which occurs only within the <attr> scope and establishes an attribute value scope. Within the annotation scope, the default object being modified is the annotation; within the attribute scope, the default object being modified is the attribute label; and within the value scope, the default object being modified is the attribute value.

Here's an index of all the selectors and their operators:

     <labels>
       <discard>
       <discard_failed>
       <make_spanless>
       <of_attr>
       <with_attrs>
       <demote>
       <apply>
       <discard_if_null>
       <touch>
       <untouch>
       <force_id>
       <map>
       <map_attr>
       <promote_attr>
       <discard_attrs>
       <split_attr>
       <join_attrs>
       <set_attr>

     <attrs>
       <promote>
       <discard>
       <split>
       <discard_annot_if_null>
       <discard_annot>
       <map>

     <values>
       <promote>
       <discard>
       <map>

Global scope

Within the <instructions> element, there are only two elements: the <discard_untouched> operator and the <labels> selector.

<labels>

Attributes

Attribute
Value
Obligatory?
Description
source_re a Python regular expression
no A regular expression which matches a full true label. Ignored if source is specified
source a string
no The name of a true label
excluding_re a Python regular expression
no A regular expression which matches a full true label. Ignored if "excluding" is specified
excluding a string
no The name of a true label

This selector identifies particular annotations for further processing, and establishes an annotation scope. The annotation's true label must match source or source_re (if specified) and not match excluding or excluding_re (if specified). With no attributes specified, this selector matches all annotations.

<discard_untouched>

This operator deletes all annotations which have not been "touched" (i.e., modified by another operator).

Annotation scope

Within the annotation scope, there is one selector, <attrs>; two operators which further restrict the annotations processed, <with_attrs> and <of_attr>; and numerous operations.

<attrs>

Attributes

Attribute
Value
Obligatory?
Description
source_re a Python regular expression
no A regular expression which matches the full attribute name. Ignored if source is specified
source a string
no The name of an attribute
excluding_re a Python regular expression no A regular expression which matches the full attribute name. Ignored if source is specified
excluding a string
no The name of an attribute

This selector identifies particular attributes for further processing, and establishes an attribute scope. The attribute name must match source or source_re (if specified) and not match excluding or excluding_re (if specified). With no attributes specified, this selector matches all attributes of the annotation.

<of_attr>

Attributes

Attribute
Value
Obligatory?
Description
attr_re a Python regular expression
no A regular expression which matches the full attribute name. Ignored if attr is specified.
attr a string
no The name of an attribute
label_re a Python regular expression
no A regular expression which matches a true label name. Ignored if label is specified.
label a string
no A true label

This operator further restricts the annotations specified in <labels>. An annotation which satisfies this restriction must be the value of the specified attr within an annotation of the specified label. For example:

<labels source="PERSON">
<of_attr attr_re="arg[12]" label="LOCATED"/>
...
</labels>

selects those PERSON annotations which are the value of either the arg1 or arg2 attribute of a LOCATED annotation.

If repeated, any annotation which matches one of the <of_attr> declarations satisfies the restriction.

<with_attrs>

Attributes

Attribute
Value
Obligatory?
Description
<attr> a string no the <with_attrs> element supports arbitrary attribute-value pairs

This operator further restricts the annotations specified in <labels>. Each attribute-value pair is matched to the annotation under consideration, and only those annotations which match all pairs satisfy the restriction. If the attribute is of type string, int, or float, the value specified should be the expected value; if the attribute is of type boolean, the recognized values are "yes" and "no"; and if the attribute is of type annotation, the value is the label of the attribute value. For example, if the LOCATED annotation has a boolean attribute "verbal", a string attribute "tense", and an annotation attribute "arg1", this selector and operator:

<labels source="LOCATED">
<with_attrs verbal="yes" tense="PAST" arg1="PERSON"/>
...
</labels>

selects those LOCATED annotations whose verbal value is true, whose tense value is "PAST", and whose arg1 value is an annotation with the "PERSON" label.

If repeated, any annotation which matches one of the <with_attrs> declaration satisfies the restriction.

<discard>

This operator discards the selected annotations.

<discard_failed>

This operator marks all annotations with the selected labels (not just the selected annotations) to be deleted if some restriction fails at the end of the conversion (e.g., if the task into which the annotation is read imposes an attribute restriction which the conversion hasn't satisfied).

<make_spanless>

Attributes

Attribute
Value
Obligatory?
Description
demoted_label a string
no A label name
demoted_attr a string
no An attribute name

This operator applies to all selected annotations which are spanned, and makes them spanless. You can use the optional attributes to specify an attribute and label to demote the span to; i.e., the following specification:

<labels source="LOCATED">
<make_spanless demoted_label="span" demoted_attr="extent"/>
</labels>

converts all LOCATED annotations to spanless, and introduces a new spanned annotation "span", and creates an instance of that annotation with the span of the original LOCATED annotation, and inserts that annotation into the "extent" attribute of the LOCATED annotation.

<demote>

Attributes

Attribute
Value
Obligatory?
Description
target_attr a string
yes An attribute name
target_label a string
yes A label name

This operator takes the label of the current annotation, and makes it the value of the attribute specified in target_attr, and makes the label of the annotation the label specified in target_label. For instance:

<labels source="PERSON">
<demote target_attr="TYPE" target_label="ENAMEX"/>
</labels>

converts PERSON annotations to ENAMEX annotations with the TYPE=PERSON attribute-value pair.

<apply>

Attributes

Attribute
Value
Obligatory?
Description
fn a string
yes A function name

This operation applies an arbitrary Python function to the selected annotations. This capability has not been tested, and documenting it further is beyond the scope of this documentation.

<touch>

This operator "touches" the selected annotations, and blocks them from being discarded by the global <discard_untouched> operator.

<untouch>

This operator "untouches" the selected annotations, and makes them eligible to be discarded by the global <discard_untouched> operator.

<force_id>

This operator forces the selected annotations to have an ID.

<discard_if_null>

Attributes

Attribute
Value
Obligatory?
Description
attrs a comma-separated string
yes a sequence of attribute names

This operator discards the selected annotations if the named attributes all have null values.

<map>

Attributes

Attribute
Value
Obligatory?
Description
target a string
yes A label name

This operator changes the name of the selected annotations to the name specified in the target.

<map_attr>

Attributes

Attribute
Value
Obligatory?
Description
source a string
yes An attribute name
target_aggregation "singleton", "set", or "list"
no An aggregation
target a string
no An attribute name
target_type "int", "float", "string", "boolean"
no A type name, other than "annotation"

This operator converts the attribute named in source. You can convert the aggregation (from singleton to set, for instance), or the type (from int to string, or string to int), or map the name to the specified target, or some combination of these actions.

<promote_attr>

Attributes

Attribute
Value
Obligatory?
Description
source a string
yes The name of an attribute

This operator changes the label of the selected annotations to the value of the attribute specified by the source, and removes the attribute. For example:

<labels source="ENAMEX">
<promote_attr source="TYPE"/>
</labels>

converts all ENAMEX annotations and removes the TYPE attribute; those with TYPE=PER become PER annotations, etc.

<discard_attrs>

Attributes

Attribute
Value
Obligatory?
Description
attrs a comma-separated string
yes a sequence of attribute names

This operator deletes the specified attributes for the selected annotations.

<split_attr>

Attributes

Attribute
Value
Obligatory?
Description
attr a string
yes An attribute name
target_attrs a comma-separated string
yes a sequence of attribute names

This operator takes the value of the attribute specified in attr and splits it among the attributes specified in target_attrs. This is useful if, e.g., the original attribute is a list aggregation, and you need those values in separate attributes.

<join_attrs>

Attributes

Attribute
Value
Obligatory?
Description
target_aggregation "list" or "set"
yes an aggregation name
source_attrs a comma-separated string
yes a sequence of attribute names
attr a string
yes an attribute name

This operator takes the values in the attributes specified in source_attrs and unifies them in a single aggregation of the type specified by the target_aggregation, and stores the value in the attribute specified by attr. This is useful if, e.g., the original values are spread among separate attributes, and you need those values in a single list or set aggregation.

<set_attr>

Attributes

Attribute
Value
Obligatory?
Description
attr a string
yes an attribute name
value a string
yes an attribute value
value_aggregation "list" or "set"
no an aggregation name
value_type "int", "float", "boolean", or "string"
no a type name

This operation sets the attribute specified by attr to the value specified by value, interpreted according to the value_aggregation and value_type provided. The default is a singleton string.

Attribute scope

Within the attribute scope, there is one selector, <values>; and numerous operators.

<values>

Attributes

Attribute
Value
Obligatory?
Description
source_re a Python regular expression
no a regular expression matching the full attribute value. Ignored if source is specified.
source a string
no an attribute value
excluding_re a Python regular expression
no a regular expression matching the full attribute value. Ignored if "excluding" is specified
excluding a string
no an attribute value

This selector identifies particular values for further processing, and establishes a value scope. The value, when converted to a string, must match source or source_re (if specified) and not match excluding or excluding_re (if specified). Annotation values cannot be selected; boolean values are converted to "yes" or "no" before being compared. With no values specified, this selector matches all values of the selected attributes.

<promote>

This operator is the equivalent of the <promote_attr> operator in the annotation scope. In other words, the following two specifications are equivalent:

<labels source="ENAMEX">
<promote_attr source="TYPE"/>
</labels>

<labels source="ENAMEX">
<attrs source="TYPE">
<promote/>
</attrs>
</labels>

<discard>

This operator is the equivalent of the <discard_attrs> operator in the annotation scope. In other words, the following two specifications are equivalent:

<labels source="PERSON">
<discard_attrs attrs="nomtype"/>
</labels>

<labels source="PERSON">
<attrs source="nomtype">
<discard/>
</attrs>
</labels>

This operator is sometimes more convenient, since the selection capabilities of the <attrs> selector are more flexible than the attrs attribute of the <discard_attrs> operator.

<split>

Attribute
Value
Obligatory?
Description
target_attrs a comma-separated string
yes a sequence of attribute names

This operator is essentially equivalent to the <split_attr> operator in the annotation scope.

<discard_annot_if_null>

This operator discards the annotation that bears the selected attributes if the value of all the selected attributes is null.

<discard_annot>

This operator discards the annotation that bears the selected attributes.

<map>

Attributes

Attribute
Value
Obligatory?
Description
target_aggregation "singleton", "list', or "set"
no an aggregation name
target a string
no an attribute name
target_type "int", "float", "boolean" or "string"
no a type name

This operation modifies the selected attributes in the specified way: modifying their aggregation type, modifying their target type, or changing the attribute name, or some combination of the three.

Value scope

The value scope contains three operators.

<promote>

This operator is like the <promote> operator in attribute scope, but applies to specific values. For instance:

<labels source="ENAMEX">
<attrs source="TYPE">
<values source="PER">
<promote/>
</values>
</attrs>
</labels>

 promotes only ENAMEX TYPE=PER to PER.

<discard>

This operator is like the <discard> operator in attribute scope, but discards the only the attribute-value pairs which match the selected values.

<map>

Attributes

Attribute
Value
Obligatory?
Description
target_type "int", "float", "boolean" or "string"
no a type name
target_aggregation "singleton", "list" or "set"
no an aggregation name
target a string
no the name of an attribute
target_value a string
no an attribute value

This operator is like the <map> operator in attribute scope, except you also have the option of mapping the value itself to something else (specified by target_value). For instance, if you want to convert ENAMEX TYPE=PER annotations into PERSON annotations, you can do it in one of a number of ways, for instance:

<labels source="ENAMEX">
<attrs source="TYPE">
<values source="PER">
<map target_value="PERSON"/>
<promote/>
</values>
</attrs>
</labels>

<labels source="ENAMEX">
<with_attrs TYPE="PER"/>
<promote_attr source="TYPE"/>
<map target="PERSON"/>
</labels>