MAT uses a common algorithm for comparing two documents to each
other, and for producing common scoring statistics based on that
comparison. This algorithm underlies the comparison view, the reconciliation process, and MATScore.
MAT's approach to scoring and comparison is extremely flexible
and powerful. It does not cover all possible cases of annotations
you might wish to compare, but it covers a large number of them.
In most situations, you don't need to know much about scoring and
comparison. MAT is configured with reasonably intelligent defaults
for comparing annotations to each other, including annotations
with annotation-valued attributes. If you want to change these
defaults, or further enhance the aggregation or decomposition of
scores in the score spreadsheet, or if you want multiple options
for scoring or similarity, then you should read this document.
One thing to keep in mind is that MAT cannot compare every
possible annotation set. In particular, because of the way MAT is
built, it can't compare annotation sets which contain circular
references to annotation types. Here are a couple examples:
<annotations>
<span label="PERSON"/>
<span label="LOCATION"/>
<span label="ORGANIZATION"/>
<span label="RELATION">
<filler name="arg" filler_types="RELATION"/>
</span>
</annotations>
<annotations>
<span label="PERSON"/>
<span label="LOCATION">
<filler name="arg" filler_types="RELATION"/>
</span>
<span label="ORGANIZATION"/>
<span label="RELATION">
<filler name="arg" filler_types="LOCATION"/>
</span> </annotations>
In the first example, there's a direct circular reference;
RELATION has an annotation-valued attribute which is restricted to
RELATION. In the second case, the circular reference is indirect.
At the moment, the comparison and scoring algorithm cannot handle
these cases. For more details, see below.
The MAT comparison algorithm uses a set of configurable similarity
profiles to create similarity metrics for the Kuhn-Munkres
bipartite set algorithm, and executes the algorithm in relevant
subsets, and in a stratified manner that ensures that
annotations are compared in an appropriate order. Here's how it
works.
The MAT comparison algorithm begins with a set of similarity
profiles. These are declarative descriptions, for particular
labels, how these labels should be compared: what dimensions of
the annotation to use, how the dimensions should be compared, how
the similarity of each dimension contributes to the similarity of
the whole.
The <similarity_profile> element is an immediate child of
the toplevel <task>
element. Here's an example:
<similarity_profile>
<tag_profile true_labels="PERSON,ORGANIZATION">
<dimension name="_label" weight="2"/>
<dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
<dimension name="nomtype" weight="1"/>
</tag_profile>
</similarity_profile>
So consider a PERSON annotation over characters 10 - 20 with
nomtype = PRO in document 1, vs. an ORGANIZATION annotation over
characters 11 - 20 with nomtype = NAM in document 2. The labels
don't match (so the _label dimension contribution will be 0 of a
possible 2); the nomtype values don't match (so the nomtype
dimension contribution will be 0 of a possible 1), and the spans
overlap in 9 out of a possible 10 characters (so the _span
contribution will be 8 out of a possible 8, since the overlap is
.9 and .8 is the lower bound for what counts as a full match). So
the similarity of these two annotations, given this profile, would
be (0 * 2) + (8 * 8) + (0 * 1)/(2 + 8 + 1), or about .73.
As you can see, you can specify special dimensions, like _span or
_label, or attributes of the annotations listed in the
true_labels. We describe the defaults, and the currently available
methods and dimensions, here.
In various locations in the similarity profiles (and score
profiles, too), you have the opportunity to specify a list of true
labels, as in the <tag_profile> element shown above.
Wherever you can specify true labels, you can also specify the
name of an annotation set, using the prefix "set:". E.g., in the
example above, if the annotation set descriptor named "content"
consisted entirely of the PERSON and ORGANIZATION labels, you
could also write this similarity profile as follows:
<similarity_profile>
<tag_profile true_labels="set:content">
<dimension name="_label" weight="2"/>
<dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
<dimension name="nomtype" weight="1"/>
</tag_profile>
</similarity_profile>
This shorthand is available for any value of the "true_labels"
attribute in any element of the similarity or score profiles. It
refers exclusively to the labels mentioned in the
referenced sets; if the referenced annotation set descriptor
defines an attribute which refers to an annotation label defined
in another annotation set descriptor, that label will not
be included in the expanded set of true labels.
Once we establish the similarities, we invoke the Kuhn-Munkres
bipartite set algorithm (otherwise known as the Hungarian
algorithm). Kuhn-Munkres gives you the best match among two sets,
given similarity measures among the set elements. The similarity
profiles provide these measures. Once we run Kuhn-Munkres, we have
a pairing between the annotations in the two comparison sets,
along with a similarity measure for each best pair.
When we run Kuhn-Munkres, we don't need to compare all the
annotations in one document with all the annotations in the other.
For instance, for spanned annotations, we stipulate that if two
annotations don't overlap at all, they can't form a pair. So we
can subdivide the spanned annotations into candidate overlap sets,
and run Kuhn-Munkres over many small sets instead of one big one.
In addition, if a region of text has annotations in one document
but none in the other, none of those annotations can be paired,
and we don't need to run Kuhn-Munkres at all for that subset of
annotations.
If we're trying to compare spanless annotations, we have two
choices. We first attempt to compute an "implied span", which
reaches from the start index of the earliest annotation that this
annotation "points to" (i.e., has as a value of an
annotation-valued attribute), to the end index of the latest
annotation that this annotation "points to". This computation is
performed recursively. We subdivide all spanless annotations for
which we can compute an implied span in the same way we subdivide
spanned annotations; for all other annotations, we subdivide by
label.
The presence of annotations which have annotation-valued
attributes (e.g., relations and events) poses a problem for the
simple case. Such annotations imply that what we should be doing
is finding a best graph alignment, rather than a best set
alignment. We've chosen not to deal with graph alignment. Instead,
we've chosen to require that the similarity algorithm be stratified:
all annotations which are "pointed to" must be paired before the
annotations that "point to" them are paired. The output of the
earlier stratum is used as input to the later stratum; by default,
annotation-valued attributes cannot match if their values were not
paired on a previous stratum.
This stratification is enforced on a label-by-label basis, not an
annotation-by-annotation basis. The analysis is static: if an
annotation label is listed in the label restriction of an
annotation-valued attribute in your task, it has to be paired
before the annotation that bears the annotation-valued attribute.
E.g. in:
<span label="PERSON">
<span label="EVENT">
<filler name="arg1" filler_types="PERSON"/>
</span>
the PERSON annotations must be guaranteed to be paired before the
EVENT annotations.
This guarantee can be imposed in one of two ways. Each stratum is
internally processed so that spanned annotations are paired first,
and by default, all annotations are in a single stratum. So if
your task contains some spanned and some spanless annotations, the
spanned annotations will be paired before the spanless annotations
in each stratum. If you provide no strata, they will be computed
for you, or you can define strata explicitly:
<similarity_profile>
<stratum true_labels="PERSON"/>
<stratum true_labels="EVENT"/>
...
</similarity_profile>
The order of the <stratum> elements determines the stratum
order.
The stratification requirement allows us to continue to use a set
alignment algorithm, rather than a graph alignment algorithm, but
also leads to a couple of restrictions. First, the overall
annotation pairing isn't a true global best pair, because we can't
use the annotations that "point to" a given annotation as a
candidate comparison dimension. Second, as we noted in the
introduction, annotation descriptions which have cycles can't be
compared:
<span label="PERSON">
<span label="EVENT">
<filler name="arg1" filler_types="PERSON EVENT"/>
</span>
There's nothing incoherent about such an annotation schema; it
just can't be compared or scored using the algorithm in MAT.
Note: the Kuhn-Munkres stratification is completely independent of the partition into annotation sets. This is intentional; we don't believe that these things should be coupled to each other.
Each similarity profile consists of zero or more <stratum>
declarations, and zero or more <tag_profile> declarations.
The labels referenced in the similarity profiles are true, not
effective, labels.
The <stratum> declarations are present to enforce the
stratification requirements of the comparison algorithm. If your
profile contains no <stratum> declarations, the comparison
algorithm will attempt to create a stratum order which respects
the stratification
requirements. If it isn't possible to do this, the
comparison algorithm will raise an error. If you declare at least
one <stratum>, and your strata do not mention all your
content labels, the comparison algorithm may also raise an error.
So if you have to declare strata, be sure to mention all your
content labels. You define a stratum like this:
<similarity_profile>
<stratum true_labels="label1,label2,..."/>
...
</similarity_profile>
Each <tag_profile> element defines the way a set of true
labels are compared to each other. All the labels listed must have
the dimensions specified; if the dimensions are annotation
attributes, the attribute must have the same type and aggregation
in each annotation. Here's our example again:
<similarity_profile>
<tag_profile true_labels="PERSON,ORGANIZATION">
<dimension name="_label" weight="2"/>
<dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
<dimension name="nomtype" weight="1"/>
</tag_profile>
</similarity_profile>
Here, both PERSON and ORGANIZATION must be spanned annotations
(since the _span dimension is present) and both must have a
"nomtype" attribute, and the type and aggregation of that
attribute must be the same in each (e.g., it can't be a string for
PERSON and a set of integers for ORGANIZATION).
If the annotations have attributes you want to compare, but the
attributes have different names, you can use the
<attr_equivalences> element to address that:
<similarity_profile>
<tag_profile true_labels="PERSON,ORGANIZATION">
<attr_equivalences name="nomtype" equivalences="p_nomtype,o_nomtype"/>
<dimension name="_label" weight="2"/>
<dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
<dimension name="nomtype" weight="1"/>
</tag_profile>
</similarity_profile>
So in this example, if PERSON has the p_nomtype attribute, and
ORGANIZATION has the o_nomtype attribute, you can treat them as
the same attribute for the purposes of comparison. Note that if
you use an attribute equivalence, the dimensions must refer to the
name of the equivalence.
If the algorithm has to compare two annotations which are not in
the same <tag_profile>, it will use only the non-attribute
equivalences for comparison, and use the weight of the attribute
dimensions as "dead weight" (i.e., as if the comparison score for
those dimensions is 0). It will compare annotation A to annotation
B using annotation A's profile, and then compare them using
annotation B's profile, and take the smaller of the two. In other
words, the comparison scores of annotations which are not in the
same <tag_profile> is strongly penalized.
Each dimension must have a name and a weight. You can also
specify a method; every dimension has a default method, and a
default means of applying the specified method to aggregate types
(lists and sets). At the moment, only set aggregates have an
implemented default comparison method (we haven't had much use for
the list types yet). If you don't specify a method, the default
method will be used. Some methods have optional parameters which
can also be specified in the <dimension> element, as show in
the example above for _span. Each method returns a value between 0
and 1, and this value is multiplied by the weight to determine the
total contribution to the similarity that this dimension makes.
It's possible to define your own dimension methods, but
describing this process is beyond the scope of this documentation.
The available dimensions are:
This method compares the two annotation labels. If the annotation
has an effective label, that label will be used; if the annotation
label is mentioned in an equivalence class when the scorer is
invoked (see the --equivalence_class option of MATScore), the equivalence class name
will be used. The default _label method is "label_equality". This
method accepts an optional "true_residue" option, which is a float
between 0 and 1; if there are effective label values that don't
match, but the underlying true labels do, this value will be
returned as the similarity. Examples:
<dimension name="_label" weight="2"/>
<dimension name="_label" method="label_equality" weight="2"/>
<dimension name="_label" weight="2" true_residue=".5"/>
This method compares the two spans. It can only be used with
spanned annotations; it cannot be used to compare the "implied
span" used when segmenting the Kuhn-Munkres algorithm. The default
_span method is "overlap". This method returns the fraction of the
combined span of the two annotations in which they overlap; so if
one annotation covers characters 10 - 15, and another annotation
covers characters 12 - 20, the combined span is 10 - 20, and the
overlap is 12 - 15, so the method returns .3.
In general, you'll probably prefer the default behavior, which
causes the span similarity to be proportional to the overlap.
However, if you wish to assign a threshold to the similarity, this
method accepts two optional options: "overlap_match_lower_bound",
which specifies an overlap value above which 1.0 will be returned,
and "overlap_mismatch_upper_bound", which specifies an overlap
value below which 0 will be returned. Both of these endpoints are
non-inclusive. Examples:
<dimension name="_span" method="overlap" weight="4"/> <dimension name="_span" weight="8" overlap_match_lower_bound=".8"/>
Every annotation attribute is a dimension. If a dimension name
isn't one of the special cases listed here, it's interpreted as an
attribute (or as an attribute equivalence, if such an equivalence
is declared). If one of your attributes has the same name as one
of the special cases documented here, you won't be able to refer
to it in your similarity description.
For attributes of type other than "annotation", the default
comparison method is called "equality", and simply returns 1 if
the two values are equal, 0 otherwise. If the attribute is a set
aggregation, the value is the ratio of the size of the
intersection of the set values over the size of the union of the
set values.
For attributes of type "annotation", the default comparison
method is called "similarity", and it returns 0 if the annotation
values were not paired in a previous stratum, and the computed
similarity value otherwise. If the attribute is a set aggregation,
the Kuhn-Munkres algorithm is invoked to compare the sets, using
the "similarity" comparison method to provide the similarity
weights, and the overall similarity is the sum of the similarities
found by the Kuhn-Munkres algorithm, divided by the maximum
possible similarity.
Example:
<dimension name="nomtype" weight="1"/>
If a dimension name contains a comma, it's treated as a set of
attributes to be considered together. Such a dimension must
have a method specified (and there will be no aggregate handler
for that method). At the moment, the only known method for this
case is "_annotation_set_similarity". This method is intended to
be used in the case where there are two or more annotation-valued
attributes which are interchangeable, as in the case of a
symmetric relation like "related-to". The implementation of this
method is essentially identical to the method for comparing
annotation-valued attributes which are set aggregations. Example:
<dimension name="arg1,arg2" weight="2" method="_annotation_set_similarity"/>
This special method is used primarily in the default profiles
specified below. It looks at all the attributes that
and compares them. If there are no such attributes, this
dimension is ignored; i.e., its weight will be treated as
zero. Otherwise, the similarity is the sum of the similarities of
these otherwise-undeclared attributes divided by the number of
attributes.
This special method is used primarily in the default profiles
specified below. It looks at all the attributes that
and collapses all the values for these attributes into a single
set for each annotation (including the members of any set
aggregations), and compares those sets using Kuhn-Munkres. The
idea here is that when comparing relations in a default manner,
the actual names of the arguments aren't nearly as important as
they are in the nonannotation case, and should be ignored.
The weight for this dimension is based on how many of the named relevant attributes in the two annotations could be paired. So, for example, if the names of the relevant attributes are identical in the two annotations being paired, it's possible for this dimension to match perfectly. If, on the other hand, some of the attribute names do not match, the maximum score for the dimension will be limited by the number of non-matching attribute names. The intention here is that this dimension will be able to yield perfect matches for annotations which are clearly intended to match (i.e., they have the same label), and will still increase the similarity of annotations which can't match perfectly, if some of the annotation-valued attributes overlap in their values.
As in the _nonannotation_attribute_remainder case above, if there
are no attributes which meet the requirements, this dimension's
weight will be treated as zero.
This special dimension is a "macro" dimension - it stands for all
remaining unclaimed common attributes which match a
particular description. It takes the special attributes "attrs",
"exclude_attrs", "type", and "aggregation":
<dimension name="_remainder" exclude_attrs="attr_a,attr_b" type="annotation" weight= .../>
"attrs" and "exclude_attrs" are comma-separated sequences of
attribute names, as shown. "type" is one of the legal attribute
types ("string", "int", "annotation", "boolean", "float").
"aggregation" is one of the legal aggregations ("list", "set",
"none").
Remainder dimensions are processed after all other dimensions,
including _annotation_attribute_remainder and
_nonannotation_attribute_remainder; if both of those special
dimensions are present, all remainders are ignored. In order for
this macro dimension to apply to an attribute, the attribute:
If the macro dimension applies, the comparison engine defines a
dimension for that attribute, as if it had been declared
explicitly.
Remainder dimensions are processed in the order they're defined.
If an annotation label is not mentioned in a <tag_profile>
element, and it's a spanned label, it will be assigned the profile
<dimension name="_label" weight=".1" true_residue=".5"/>
<dimension name="_span" weight=".9"/>
<dimension name="_nonannotation_attribute_remainder" weight=".1"/>
<dimension name="_annotation_attribute_remainder" weight=".1"/>
If the label is not spanned, it will be assigned the profile
<dimension name="_label" weight=".2" true_residue=".5"/>
<dimension name="_nonannotation_attribute_remainder" weight=".2"/>
<dimension name="_annotation_attribute_remainder" weight=".6"/>
on the assumption that spanless annotations are typically used as
relations.
You can define as many similarity profiles as you like, naming
them via the name attribute:
<similarity_profile name="...">
...
</similarity_profile>
If you leave one of the profiles unnamed, it will be the default profile for the task; if you don't specify a similarity profile, the task default will be used (in the UI, this option will be labeled "(default)"). You can always access the system default profile by using the name "<matdefault>" (on the command line) or "(system default)" (in the UI).
The similarities among the annotations are used to feed the MAT
scorer. At the moment, annotation pairs which have similarity
scores of 1.0 are treated as matching; pairs with scores less than
1.0 are treated as clashing; and annotations which are not paired
are treated as spurious or missing, as appropriate. There is
currently no facility in the MAT scorer for treating a similarity
score less than 1.0 as a partial match.
By default, the scorer reports numbers for each true or effective
label, and for all labels aggregated together, for each document
and for all documents aggregated together. You can modify this
behavior using the <score_profile> element, and its child
elements.
The <score_profile> element is an immediate child of the
toplevel <task>
element. Here's an example:
<score_profile>
<aggregation name="loc elements" true_labels="LOCATION,GEOPOLITICAL_ENTITY,FACILITY"/>
<attr_decomposition attrs="nomtype" true_labels="PERSON"/>
<label_limitation true_labels="LOCATION,GEOPOLITICAL_ENTITY,FACILITY,PERSON"/>
<attrs_alone true_labels="PERSON"/>
</score_profile>
Remember, the "set:" shorthand is available for any value of the "true_labels" attribute in any element of the similarity or score profiles.
The <aggregation> element can be used to specify a
label under which the scores for a list of true labels will be
aggregated. This aggregation is in addition to the defaults. This
level is specified by a list of true labels, and is "in between"
the individual labels and the aggregation of all labels.
In addition to aggregations, you can specify decompositions of
various labels. You can specify a decomposition by attribute (as
shown here), or by use of a method (using the
<partition_decomposition> element, not exemplified here).
For the specified labels, the scorer will report a separate score
for those annotations bearing each of the known values for that
label's attributes. So in the example we've been using on this
page, if the nomtype attribute has three values NAM, NOM and PRO,
the scorer will give you a separate subscore for PERSON
annotations where nomtype=NAM, etc.
The scorer gives you the option of
ignoring labels, using the --ignore parameter. But this parameter
doesn't simply cause the score output to ignore the annotations;
it removes them from consideration entirely, as if they aren't
present at all. This can lead to problems if, e.g., you're scoring
relations, and you only want to see the relation scores, but you
can't ignore the entity annotations, since they're a crucial
ingredient to the relation comparison. You can use the
<label_limitation> element to list just those labels you
want to see in the score output.
The attribute decompositions will show you annotation scores for
annotations which bear particular attributes. If, on the other
hand, you want to know the scores for the attributes themselves -
if, e.g., you have an attribute trainer/tagger - you can use the
<attrs_alone> element to produce these separate scores. If
an annotation's true label is specified in <attrs_alone>,
the scorer will add an "attr alone" column to the scoring output,
and for each pair of annotations, it will report the matches and
clashes, and generate scores accordingly. It will do this for any
label decomposition or aggregation the annotation is eligible for.
The attributes will be reported for the various pairing
dimensions as follows:
Note that you won't necessarily get as many total elements for
the attributes alone as you will for the labels themselves; for
annotations which are not paired (i.e., missing or spurious),
there will be no entry. Also, keep in mind that the attributes can
match when the labels don't, and in this case, the behavior is
identical to the behavior you find when annotations match in spite
of a label clash (e.g., if the label is not part of the similarity
profile): the match will be recorded under the attribute of the
reference annotation.
Like <similarity_profile>, you can define as many score
profiles as you like, naming them via the name attribute:
<score_profile name="...">
...
</score_profile>
If you leave one of the profiles unnamed, it will be the default
profile for the task; if you don't specify a score profile, the
task default will be used. You can always access the system
default profile by using the name "<default>".
The inspiration for configurable comparison is due to the scorer
used for the ACE 2005 evaluation. The inspiration for using
Kuhn-Munkres as a basis for a general-purpose comparison algorithm
is due to
Xiaoqiang Luo, "On Coreference Resolution Performance Metrics", Proceedings
of Human Language Technology Conference and Conference on
Empirical Methods in Natural Language Processing (HLT/EMNLP),
pages 25-32, Vancouver, October 2005.