Document conversion XML reference

The MAT transducer supports a very sophisticated conversion language which supports an enormous range of operations on annotations: changing labels, demoting and promoting labels, extracting spans, changing attribute values, etc. Using this language, you can do an enormous range of things to annotations without ever writing any code.

Overview

The general format of the XML is

<instructions>
<...selector ...>
<...operation.../>
<...selector ...>
...
</...selector...>
</...selector...>
<...operation.../>
...
</instructions>

The selectors or operations are processed in order, and each is applied to the current state of the annotations. You can have as many selectors or operations as you want. For instance, let's say you want to change the name of the PER annotation to PERSON, and the ORG annotation to ORGANIZATION, and you want to rename the nomtype attribute on all of them to NOMTYPE:

<instructions>
<labels source="PER">
<map target="PERSON"/>
</labels>
<labels source="ORG">
<map target="ORGANIZATION"/>
</labels>
<labels source_re="ORGANIZATION|PERSON">
<map_attr source="nomtype" target="NOMTYPE"/>
</labels>
</instructions>

Selectors and selector scope

There are three selectors: <labels>, which establishes an annotation scope; <attrs>, which occurs only within the <labels> scope and establishes an attribute scope, and <values>, which occurs only within the <attrs> scope and establishes an attribute value scope. Within the annotation scope, the default object being modified is the annotation; within the attribute scope, the default object being modified is the attribute label; and within the value scope, the default object being modified is the attribute value.

Here's an index of all the selectors and their operators:

     <labels>
       <discard>
       <discard_failed>
       <make_spanless>
       <of_attr>
       <with_attrs>
       <demote>
       <apply>
       <discard_if_null>
       <touch>
       <untouch>
       <force_id>
       <map>
       <map_attr>
       <promote_attr>
       <discard_attrs>
       <split_attr>
       <join_attrs>
       <set_attr>

     <attrs>
       <promote>
       <discard>
       <split>
       <discard_annot_if_null>
       <discard_annot>
       <map>

     <values>
       <promote>
       <discard>
       <discard_annot>
       <map>

Global scope

Within the <instructions> element, there are only three elements: the <discard_untouched> operator, the <copy_metadata> operator, and the <labels> selector.

<labels>

Attributes

Attribute
Value
Obligatory?
Description
source_re a Python regular expression
no A regular expression which matches a full true label. Ignored if source is specified
source a string
no The name of a true label
excluding_re a Python regular expression
no A regular expression which matches a full true label. Ignored if "excluding" is specified
excluding a string
no The name of a true label

This selector identifies particular annotations for further processing, and establishes an annotation scope. The annotation's true label must match source or source_re (if specified) and not match excluding or excluding_re (if specified). With no attributes specified, this selector matches all annotations.

<discard_untouched>

This operator deletes all annotations which have not been "touched" (i.e., modified by another operator).

<copy_metadata>

Attributes

Attribute
Value
Obligatory?
Description
keys
a comma-separated string
no
a sequence of metadata keys to copy

Under normal circumstances, the document metadata is not copied when documents are converted, with the exception of one (undocumented) internal key which tracks the completedness of the annotation type sets. If you want to force metadata to be copied, use this element. If the keys attribute is not provided, all keys will be copied. Most users will never, ever need to use this operator.

Annotation scope

Within the annotation scope, there is one selector, <attrs>; two operators which further restrict the annotations processed, <with_attrs> and <of_attr>; and numerous operations.

<attrs>

Attributes

Attribute
Value
Obligatory?
Description
source_re a Python regular expression
no A regular expression which matches the full attribute name. Ignored if source is specified
source a string
no The name of an attribute
excluding_re a Python regular expression no A regular expression which matches the full attribute name. Ignored if source is specified
excluding a string
no The name of an attribute

This selector identifies particular attributes for further processing, and establishes an attribute scope. The attribute name must match source or source_re (if specified) and not match excluding or excluding_re (if specified). With no attributes specified, this selector matches all attributes of the annotation.

<of_attr>

Attributes

Attribute
Value
Obligatory?
Description
attr_re a Python regular expression
no A regular expression which matches the full attribute name. Ignored if attr is specified.
attr a string
no The name of an attribute
label_re a Python regular expression
no A regular expression which matches a true label name. Ignored if label is specified.
label a string
no A true label

This operator further restricts the annotations specified in <labels>. An annotation which satisfies this restriction must be the value of the specified attr within an annotation of the specified label. For example:

<labels source="PERSON">
<of_attr attr_re="arg[12]" label="LOCATED"/>
...
</labels>

selects those PERSON annotations which are the value of either the arg1 or arg2 attribute of a LOCATED annotation.

If repeated, any annotation which matches one of the <of_attr> declarations satisfies the restriction.

<with_attrs>

Attributes

Attribute
Value
Obligatory?
Description
<attr> a string no the <with_attrs> element supports arbitrary attribute-value pairs

This operator further restricts the annotations specified in <labels>. Each attribute-value pair is matched to the annotation under consideration, and only those annotations which match all pairs satisfy the restriction.

Each annotation attribute value is compared to the listed value as follows:

For example, if the LOCATED annotation has a boolean attribute "verbal", a string attribute "tense", and an annotation attribute "arg1", this selector and operator:

<labels source="LOCATED">
<with_attrs verbal="yes" tense="PAST" arg1="PERSON"/>
...
</labels>

selects those LOCATED annotations whose verbal value is true, whose tense value is "PAST", and whose arg1 value is an annotation with the "PERSON" label.

If repeated, any annotation which matches one of the <with_attrs> declaration satisfies the restriction.

<discard>

This operator discards the selected annotations.

<discard_failed>

This operator marks all annotations with the selected labels (not just the selected annotations) to be deleted if some restriction fails at the end of the conversion (e.g., if the task into which the annotation is read imposes an attribute restriction which the conversion hasn't satisfied).

<make_spanless>

Attributes

Attribute
Value
Obligatory?
Description
demoted_label a string
no A label name
demoted_attr a string
no An attribute name

This operator applies to all selected annotations which are spanned, and makes them spanless. You can use the optional attributes to specify an attribute and label to demote the span to; i.e., the following specification:

<labels source="LOCATED">
<make_spanless demoted_label="span" demoted_attr="extent"/>
</labels>

converts all LOCATED annotations to spanless, and introduces a new spanned annotation "span", and creates an instance of that annotation with the span of the original LOCATED annotation, and inserts that annotation into the "extent" attribute of the LOCATED annotation. If the parent <labels> selector element specifies a source_re, and the value of the demoted_label attribute has a regexp-style backreference (e.g., "\1"), then the demoted label will be computed by substituting in the appropriate regexp groups in the <labels> source_re.

<demote>

Attributes

Attribute
Value
Obligatory?
Description
target_attr a string
yes An attribute name
target_label a string
yes A label name

This operator takes the label of the current annotation, and makes it the value of the attribute specified in target_attr, and makes the label of the annotation the label specified in target_label. If the parent <labels> selector element specifies a source_re, and the value of the target_label attribute has a regexp-style backreference (e.g., "\1"), then the target label will be computed by substituting in the appropriate regexp groups in the <labels> source_re.

For instance:

<labels source="PERSON">
<demote target_attr="TYPE" target_label="ENAMEX"/>
</labels>

converts PERSON annotations to ENAMEX annotations with the TYPE=PERSON attribute-value pair.

<apply>

Attributes

Attribute
Value
Obligatory?
Description
fn a string
yes A function name

This operation applies an arbitrary Python function to the selected annotations. This capability has not been tested, and documenting it further is beyond the scope of this documentation.

<touch>

This operator "touches" the selected annotations, and blocks them from being discarded by the global <discard_untouched> operator.

<untouch>

This operator "untouches" the selected annotations, and makes them eligible to be discarded by the global <discard_untouched> operator.

<force_id>

This operator forces the selected annotations to have an ID.

<discard_if_null>

Attributes

Attribute
Value
Obligatory?
Description
attrs a comma-separated string
yes a sequence of attribute names

This operator discards the selected annotations if the named attributes all have null values.

<map>

Attributes

Attribute
Value
Obligatory?
Description
target a string
yes A label name

This operator changes the name of the selected annotations to the name specified in the target. If the parent <labels> selector element specifies a source_re, and the value of the target attribute has a regexp-style backreference (e.g., "\1"), then the target label will be computed by substituting in the appropriate regexp groups in the <labels> source_re.

<map_attr>

Attributes

Attribute
Value
Obligatory?
Description
source a string
no An attribute name
source_re
a Python regular expression
no
A regular expression describing the matching attrs, to be used when source is not provided
target_aggregation "singleton", "set", or "list"
no An aggregation
target a string
no An attribute name
target_type "int", "float", "string", "boolean"
no A type name, other than "annotation"

This operator converts the attribute named in source or identified in source_re. You can convert the aggregation (from singleton to set, for instance), or the type (from int to string, or string to int), or map the name to the specified target, or some combination of these actions. If source_re is specified, and the value of the target attribute has a regexp-style backreference (e.g., "\1"), then the target attr will be computed by substituting in the appropriate regexp groups in the <labels> source_re.

<promote_attr>

Attributes

Attribute
Value
Obligatory?
Description
source a string
yes The name of an attribute

This operator changes the label of the selected annotations to the value of the attribute specified by the source, and removes the attribute. For example:

<labels source="ENAMEX">
<promote_attr source="TYPE"/>
</labels>

converts all ENAMEX annotations and removes the TYPE attribute; those with TYPE=PER become PER annotations, etc.

<discard_attrs>

Attributes

Attribute
Value
Obligatory?
Description
attrs a comma-separated string
no
a sequence of attribute names
attr_re
a Python regular expression
no
a regular expression matching the attribute name if attrs is not specifed

This operator deletes the specified attributes for the selected annotations.

<split_attr>

Attributes

Attribute
Value
Obligatory?
Description
attr a string
yes An attribute name
target_attrs a comma-separated string
yes a sequence of attribute names

This operator takes the value of the attribute specified in attr and splits it among the attributes specified in target_attrs. This is useful if, e.g., the original attribute is a list aggregation, and you need those values in separate attributes.

<join_attrs>

Attributes

Attribute
Value
Obligatory?
Description
target_aggregation "list" or "set"
yes an aggregation name
source_attrs a comma-separated string
yes a sequence of attribute names
attr a string
yes an attribute name

This operator takes the values in the attributes specified in source_attrs and unifies them in a single aggregation of the type specified by the target_aggregation, and stores the value in the attribute specified by attr. This is useful if, e.g., the original values are spread among separate attributes, and you need those values in a single list or set aggregation.

<set_attr>

Attributes

Attribute
Value
Obligatory?
Description
attr a string
yes an attribute name
value a string
yes an attribute value
value_aggregation "list" or "set"
no an aggregation name
value_type "int", "float", "boolean", or "string"
no a type name

This operation sets the attribute specified by attr to the value specified by value, interpreted according to the value_aggregation and value_type provided. The default is a singleton string. If the value_type is "boolean", the recognized values are "yes" and "no".

Attribute scope

Within the attribute scope, there is one selector, <values>; and numerous operators.

<values>

Attributes

Attribute
Value
Obligatory?
Description
source_re a Python regular expression
no a regular expression matching the full attribute value. Ignored if source is specified.
source a string
no an attribute value
excluding_re a Python regular expression
no a regular expression matching the full attribute value. Ignored if "excluding" is specified
excluding a string
no an attribute value

This selector identifies particular values for further processing, and establishes a value scope. With no values specified, this selector matches all values of the selected attributes.

The comparisons are made in almost the same way that they're made for <with_attrs>:

In addition, for any list or set value, the elements are converted according to the rules above, and then the values are concatenated together with a comma, and the whole expression is bracketed with vertical bars. So an attribute value set { 3, 4, 5 } will be converted to the string "|3,4,5|", and a set of two PERSON annotations will be converted to the string "|PERSON,PERSON|". The user can take advantage of this conversion and its delimiters to match these values using source_re or excluding_re. E.g., the following element will choose annotations with a PERSON annotation in the set which is the value of the "participants" attribute:

<values source_re="[,|]PERSON[,|]">
...
</values>

<promote>

This operator is the equivalent of the <promote_attr> operator in the annotation scope. In other words, the following two specifications are equivalent:

<labels source="ENAMEX">
<promote_attr source="TYPE"/>
</labels>

<labels source="ENAMEX">
<attrs source="TYPE">
<promote/>
</attrs>
</labels>

<discard>

This operator is the equivalent of the <discard_attrs> operator in the annotation scope. In other words, the following two specifications are equivalent:

<labels source="PERSON">
<discard_attrs attrs="nomtype"/>
</labels>

<labels source="PERSON">
<attrs source="nomtype">
<discard/>
</attrs>
</labels>

This operator is sometimes more convenient, since the selection capabilities of the <attrs> selector are more flexible than the attrs attribute of the <discard_attrs> operator.

<split>

Attribute
Value
Obligatory?
Description
target_attrs a comma-separated string
yes a sequence of attribute names

This operator is essentially equivalent to the <split_attr> operator in the annotation scope.

<discard_annot_if_null>

This operator discards the annotation that bears the selected attributes if the value of all the selected attributes is null.

<discard_annot>

This operator discards the annotation that bears the selected attributes.

<map>

Attributes

Attribute
Value
Obligatory?
Description
target_aggregation "singleton", "list', or "set"
no an aggregation name
target a string
no an attribute name
target_type "int", "float", "boolean" or "string"
no a type name

This operation modifies the selected attributes in the specified way: modifying their aggregation type, modifying their target type, or changing the attribute name, or some combination of the three. If the value_type is "boolean", the recognized values are "yes" and "no". If the parent <attrs> selector element specifies a source_re, and the value of the target attribute has a regexp-style backreference (e.g., "\1"), then the target attr will be computed by substituting in the appropriate regexp groups in the <attrs> source_re.

Value scope

The value scope contains three operators.

<promote>

This operator is like the <promote> operator in attribute scope, but applies to specific values. For instance:

<labels source="ENAMEX">
<attrs source="TYPE">
<values source="PER">
<promote/>
</values>
</attrs>
</labels>

 promotes only ENAMEX TYPE=PER to PER.

<discard>

This operator is like the <discard> operator in attribute scope, but discards the only the attribute-value pairs which match the selected values.

<discard_annot>

This operator is identical to the <discard_annot> operator in attribute scope.

<map>

Attributes

Attribute
Value
Obligatory?
Description
target_type "int", "float", "boolean" or "string"
no a type name
target_aggregation "singleton", "list" or "set"
no an aggregation name
target a string
no the name of an attribute
target_value a string
no an attribute value

This operator is like the <map> operator in attribute scope, except you also have the option of mapping the value itself to something else (specified by target_value). If the value_type is "boolean", the recognized values are "yes" and "no". If the parent <attrs> selector element specifies a source_re, and the value of the target attribute has a regexp-style backreference (e.g., "\1"), then the target attr will be computed by substituting in the appropriate regexp groups in the <attrs> source_re. Similarly, If the parent <values> selector element specifies a source_re, and the value of the target_value attribute has a regexp-style backreference (e.g., "\1"), then the target value will be computed by substituting in the appropriate regexp groups in the <values> source_re.

When a target_aggregation requests the "singleton" target for a list value, the first element of the list will be retrieved; for a set value, a random element will be retrieved.

For instance, if you want to convert ENAMEX TYPE=PER annotations into PERSON annotations, you can do it in one of a number of ways, for instance:

<labels source="ENAMEX">
<attrs source="TYPE">
<values source="PER">
<map target_value="PERSON"/>
<promote/>
</values>
</attrs>
</labels>

<labels source="ENAMEX">
<with_attrs TYPE="PER"/>
<promote_attr source="TYPE"/>
<map target="PERSON"/>
</labels>