The MAT transducer supports a
very sophisticated conversion language which supports an enormous
range of operations on annotations: changing labels, demoting and
promoting labels, extracting spans, changing attribute values,
etc. Using this language, you can do an enormous range of things
to annotations without ever writing any code.
The general format of the XML is
<instructions>
<...selector ...>
<...operation.../>
<...selector ...>
...
</...selector...>
</...selector...>
<...operation.../>
...
</instructions>
The selectors or operations are processed in order, and each is
applied to the current state of the annotations. You can have as
many selectors or operations as you want. For instance, let's say
you want to change the name of the PER annotation to PERSON, and
the ORG annotation to ORGANIZATION, and you want to rename the
nomtype attribute on all of them to NOMTYPE:
<instructions>
<labels source="PER">
<map target="PERSON"/>
</labels>
<labels source="ORG">
<map target="ORGANIZATION"/>
</labels>
<labels source_re="ORGANIZATION|PERSON">
<map_attr source="nomtype" target="NOMTYPE"/>
</labels>
</instructions>
There are three selectors: <labels>, which establishes an
annotation scope; <attrs>, which occurs only within the
<labels> scope and establishes an attribute scope, and
<values>, which occurs only within the <attrs> scope
and establishes an attribute value scope. Within the annotation
scope, the default object being modified is the annotation; within
the attribute scope, the default object being modified is the
attribute label; and within the value scope, the default object
being modified is the attribute value.
Here's an index of all the selectors and their operators:
<labels>
<discard>
<discard_failed>
<make_spanless>
<of_attr>
<with_attrs>
<demote>
<apply>
<discard_if_null>
<touch>
<untouch>
<force_id>
<map>
<map_attr>
<promote_attr>
<discard_attrs>
<split_attr>
<join_attrs>
<set_attr>
<attrs>
<promote>
<discard>
<split>
<discard_annot_if_null>
<discard_annot>
<map>
<values>
<promote>
<discard>
<discard_annot>
<map>
Within the <instructions> element, there are only three elements: the <discard_untouched> operator, the <copy_metadata> operator, and the <labels> selector.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
source_re | a Python regular expression |
no | A regular expression which
matches a full true label. Ignored if source is specified |
source | a string |
no | The name of a true label |
excluding_re | a Python regular expression |
no | A regular expression which
matches a full true label. Ignored if "excluding" is
specified |
excluding | a string |
no | The name of a true label |
This selector identifies particular annotations for further
processing, and establishes an annotation scope. The annotation's
true label must match source or source_re (if specified) and not
match excluding or excluding_re (if specified). With no attributes
specified, this selector matches all annotations.
This operator deletes all annotations which have not been
"touched" (i.e., modified by another operator).
Attribute | Value | Obligatory? | Description |
---|---|---|---|
keys | a comma-separated string | no | a sequence of metadata keys to copy |
Under normal circumstances, the document metadata is not copied when documents are converted, with the exception of one (undocumented) internal key which tracks the completedness of the annotation type sets. If you want to force metadata to be copied, use this element. If the keys attribute is not provided, all keys will be copied. Most users will never, ever need to use this operator.
Within the annotation scope, there is one selector,
<attrs>; two operators which further restrict the
annotations processed, <with_attrs> and <of_attr>; and
numerous operations.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
source_re | a Python regular expression |
no | A regular expression which
matches the full attribute name. Ignored if source is
specified |
source | a string |
no | The name of an attribute |
excluding_re | a Python regular expression | no | A regular expression which matches the full attribute name. Ignored if source is specified |
excluding | a string |
no | The name of an attribute |
This selector identifies particular attributes for further
processing, and establishes an attribute scope. The attribute name
must match source or source_re (if specified) and not match
excluding or excluding_re (if specified). With no attributes
specified, this selector matches all attributes of the annotation.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attr_re | a Python regular expression |
no | A regular expression which
matches the full attribute name. Ignored if attr is
specified. |
attr | a string |
no | The name of an attribute |
label_re | a Python regular expression |
no | A regular expression which
matches a true label name. Ignored if label is specified. |
label | a string |
no | A true label |
This operator further restricts the annotations specified in
<labels>. An annotation which satisfies this restriction
must be the value of the specified attr within an annotation of
the specified label. For example:
<labels source="PERSON">
<of_attr attr_re="arg[12]" label="LOCATED"/>
...
</labels>
selects those PERSON annotations which are the value of either
the arg1 or arg2 attribute of a LOCATED annotation.
If repeated, any annotation which matches one of the
<of_attr> declarations satisfies the restriction.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> | a string | no | the <with_attrs> element supports arbitrary attribute-value pairs |
This operator further restricts the annotations specified in
<labels>. Each attribute-value pair is matched to the
annotation under consideration, and only those annotations which
match all pairs satisfy the restriction.
Each annotation attribute value is compared to the listed value as follows:
For example, if the LOCATED annotation has a boolean
attribute "verbal", a string attribute "tense", and an annotation
attribute "arg1", this selector and operator:
<labels source="LOCATED">
<with_attrs verbal="yes" tense="PAST" arg1="PERSON"/>
...
</labels>
selects those LOCATED annotations whose verbal value is true,
whose tense value is "PAST", and whose arg1 value is an annotation
with the "PERSON" label.
If repeated, any annotation which matches one of the
<with_attrs> declaration satisfies the restriction.
This operator discards the selected annotations.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
demoted_label | a string |
no | A label name |
demoted_attr | a string |
no | An attribute name |
This operator applies to all selected annotations which are
spanned, and makes them spanless. You can use the optional
attributes to specify an attribute and label to demote the span
to; i.e., the following specification:
<labels source="LOCATED">
<make_spanless demoted_label="span" demoted_attr="extent"/>
</labels>
converts all LOCATED annotations to spanless, and introduces a new spanned annotation "span", and creates an instance of that annotation with the span of the original LOCATED annotation, and inserts that annotation into the "extent" attribute of the LOCATED annotation. If the parent <labels> selector element specifies a source_re, and the value of the demoted_label attribute has a regexp-style backreference (e.g., "\1"), then the demoted label will be computed by substituting in the appropriate regexp groups in the <labels> source_re.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
target_attr | a string |
yes | An attribute name |
target_label | a string |
yes | A label name |
This operator takes the label of the current annotation, and makes it the value of the attribute specified in target_attr, and makes the label of the annotation the label specified in target_label. If the parent <labels> selector element specifies a source_re, and the value of the target_label attribute has a regexp-style backreference (e.g., "\1"), then the target label will be computed by substituting in the appropriate regexp groups in the <labels> source_re.
For instance:
<labels source="PERSON">
<demote target_attr="TYPE" target_label="ENAMEX"/>
</labels>
converts PERSON annotations to ENAMEX annotations with the
TYPE=PERSON attribute-value pair.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
fn | a string |
yes | A function name |
This operation applies an arbitrary Python function to the
selected annotations. This capability has not been tested, and
documenting it further is beyond the scope of this documentation.
This operator "touches" the selected annotations, and blocks them
from being discarded by the global <discard_untouched>
operator.
This operator "untouches" the selected annotations, and makes
them eligible to be discarded by the global
<discard_untouched> operator.
This operator forces the selected annotations to have an ID.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attrs | a comma-separated string |
yes | a sequence of attribute names |
This operator discards the selected annotations if the named
attributes all have null values.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
target | a string |
yes | A label name |
This operator changes the name of the selected annotations to the
name specified in the target. If the parent <labels> selector element specifies a source_re, and the value of the target attribute has a regexp-style backreference (e.g., "\1"), then the target label will be computed by substituting in the appropriate regexp groups in the <labels> source_re.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
source | a string |
no | An attribute name |
source_re | a Python regular expression | no | A regular expression describing the matching attrs, to be used when source is not provided |
target_aggregation | "singleton", "set", or "list" |
no | An aggregation |
target | a string |
no | An attribute name |
target_type | "int", "float", "string",
"boolean" |
no | A type name, other than
"annotation" |
This operator converts the attribute named in source or identified in source_re. You can convert the aggregation (from singleton to set, for instance), or the type (from int to string, or string to int), or map the name to the specified target, or some combination of these actions. If source_re is specified, and the value of the target attribute has a regexp-style backreference (e.g., "\1"), then the target attr will be computed by substituting in the appropriate regexp groups in the <labels> source_re.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
source | a string |
yes | The name of an attribute |
This operator changes the label of the selected annotations to
the value of the attribute specified by the source, and removes
the attribute. For example:
<labels source="ENAMEX">
<promote_attr source="TYPE"/>
</labels>
converts all ENAMEX annotations and removes the TYPE attribute;
those with TYPE=PER become PER annotations, etc.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attrs | a comma-separated string |
no |
a sequence of attribute names |
attr_re | a Python regular expression | no | a regular expression matching the attribute name if attrs is not specifed |
This operator deletes the specified attributes for the selected
annotations.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attr | a string |
yes | An attribute name |
target_attrs | a comma-separated string |
yes | a sequence of attribute names |
This operator takes the value of the attribute specified in attr
and splits it among the attributes specified in target_attrs. This
is useful if, e.g., the original attribute is a list aggregation,
and you need those values in separate attributes.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
target_aggregation | "list" or "set" |
yes | an aggregation name |
source_attrs | a comma-separated string |
yes | a sequence of attribute names |
attr | a string |
yes | an attribute name |
This operator takes the values in the attributes specified in
source_attrs and unifies them in a single aggregation of the type
specified by the target_aggregation, and stores the value in the
attribute specified by attr. This is useful if, e.g., the original
values are spread among separate attributes, and you need those
values in a single list or set aggregation.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attr | a string |
yes | an attribute name |
value | a string |
yes | an attribute value |
value_aggregation | "list" or "set" |
no | an aggregation name |
value_type | "int", "float", "boolean", or
"string" |
no | a type name |
This operation sets the attribute specified by attr to the value
specified by value, interpreted according to the value_aggregation
and value_type provided. The default is a singleton string. If the value_type is "boolean", the recognized values are "yes" and "no".
Within the attribute scope, there is one selector,
<values>; and numerous operators.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
source_re | a Python regular expression |
no | a regular expression matching
the full attribute value. Ignored if source is specified. |
source | a string |
no | an attribute value |
excluding_re | a Python regular expression |
no | a regular expression matching
the full attribute value. Ignored if "excluding" is
specified |
excluding | a string |
no | an attribute value |
This selector identifies particular values for further
processing, and establishes a value scope. With no values specified,
this selector matches all values of the selected attributes.
The comparisons are made in almost the same way that they're made for <with_attrs>:
In addition, for any list or set value, the elements are converted according to the rules above, and then the values are concatenated together with a comma, and the whole expression is bracketed with vertical bars. So an attribute value set { 3, 4, 5 } will be converted to the string "|3,4,5|", and a set of two PERSON annotations will be converted to the string "|PERSON,PERSON|". The user can take advantage of this conversion and its delimiters to match these values using source_re or excluding_re. E.g., the following element will choose annotations with a PERSON annotation in the set which is the value of the "participants" attribute:
<values source_re="[,|]PERSON[,|]">
...
</values>
This operator is the equivalent of the <promote_attr>
operator in the annotation scope. In other words, the following
two specifications are equivalent:
<labels source="ENAMEX">
<promote_attr source="TYPE"/>
</labels>
<labels source="ENAMEX">
<attrs source="TYPE">
<promote/>
</attrs>
</labels>
This operator is the equivalent of the <discard_attrs>
operator in the annotation scope. In other words, the following
two specifications are equivalent:
<labels source="PERSON">
<discard_attrs attrs="nomtype"/>
</labels>
<labels source="PERSON">
<attrs source="nomtype">
<discard/>
</attrs>
</labels>
This operator is sometimes more convenient, since the selection
capabilities of the <attrs> selector are more flexible than
the attrs attribute of the <discard_attrs> operator.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
target_attrs | a comma-separated string |
yes | a sequence of attribute names |
This operator is essentially equivalent to the <split_attr>
operator in the annotation scope.
This operator discards the annotation that bears the selected
attributes if the value of all the selected attributes is null.
This operator discards the annotation that bears the selected
attributes.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
target_aggregation | "singleton", "list', or "set" |
no | an aggregation name |
target | a string |
no | an attribute name |
target_type | "int", "float", "boolean" or
"string" |
no | a type name |
This operation modifies the selected attributes in the specified way: modifying their aggregation type, modifying their target type, or changing the attribute name, or some combination of the three. If the value_type is "boolean", the recognized values are "yes" and "no". If the parent <attrs> selector element specifies a source_re, and the value of the target attribute has a regexp-style backreference (e.g., "\1"), then the target attr will be computed by substituting in the appropriate regexp groups in the <attrs> source_re.
The value scope contains three operators.
This operator is like the <promote> operator in attribute
scope, but applies to specific values. For instance:
<labels source="ENAMEX">
<attrs source="TYPE">
<values source="PER">
<promote/>
</values>
</attrs>
</labels>
promotes only ENAMEX TYPE=PER to PER.
This operator is like the <discard> operator in attribute
scope, but discards the only the attribute-value pairs which match
the selected values.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
target_type | "int", "float", "boolean" or
"string" |
no | a type name |
target_aggregation | "singleton", "list" or "set" |
no | an aggregation name |
target | a string |
no | the name of an attribute |
target_value | a string |
no | an attribute value |
This operator is like the <map> operator in attribute
scope, except you also have the option of mapping the value itself
to something else (specified by target_value). If the value_type is "boolean", the recognized values are "yes" and "no". If the parent <attrs> selector element specifies a source_re, and
the value of the target attribute has a regexp-style backreference
(e.g., "\1"), then the target attr will be computed by substituting in
the appropriate regexp groups in the <attrs> source_re. Similarly, If the parent <values> selector element specifies a source_re, and
the value of the target_value attribute has a regexp-style backreference
(e.g., "\1"), then the target value will be computed by substituting in
the appropriate regexp groups in the <values> source_re.
When a target_aggregation requests the "singleton" target for a list value, the first element of the list will be retrieved; for a set value, a random element will be retrieved.
For instance, if
you want to convert ENAMEX TYPE=PER annotations into PERSON
annotations, you can do it in one of a number of ways, for
instance:
<labels source="ENAMEX">
<attrs source="TYPE">
<values source="PER">
<map target_value="PERSON"/>
<promote/>
</values>
</attrs>
</labels>
<labels source="ENAMEX">
<with_attrs TYPE="PER"/>
<promote_attr source="TYPE"/>
<map target="PERSON"/>
</labels>