Scoring and comparison

MAT uses a common algorithm for comparing two documents to each other, and for producing common scoring statistics based on that comparison. This algorithm underlies the comparison view, the reconciliation process, and MATScore.

MAT's approach to scoring and comparison is extremely flexible and powerful. It does not cover all possible cases of annotations you might wish to compare, but it covers a large number of them.

When you need to understand scoring and comparison, and when you don't

In most situations, you don't need to know much about scoring and comparison. MAT is configured with reasonably intelligent defaults for comparing annotations to each other, including annotations with annotation-valued attributes. If you want to change these defaults, or further enhance the aggregation or decomposition of scores in the score spreadsheet, or if you want multiple options for scoring or similarity, then you should read this document.

One thing to keep in mind is that MAT cannot compare every possible annotation set. In particular, because of the way MAT is built, it can't compare annotation sets which contain circular references to annotation types. Here are a couple examples:

  <annotations>
<span label="PERSON"/>
<span label="LOCATION"/>
<span label="ORGANIZATION"/>
<span label="RELATION">
<filler name="arg" filler_types="RELATION"/>
</span>
</annotations>
  <annotations>
<span label="PERSON"/>
<span label="LOCATION">
<filler name="arg" filler_types="RELATION"/>
</span>
<span label="ORGANIZATION"/>
<span label="RELATION">
<filler name="arg" filler_types="LOCATION"/>
</span> </annotations>

In the first example, there's a direct circular reference; RELATION has an annotation-valued attribute which is restricted to RELATION. In the second case, the circular reference is indirect. At the moment, the comparison and scoring algorithm cannot handle these cases. For more details, see below.

The comparison algorithm

The MAT comparison algorithm uses a set of configurable similarity profiles to create similarity metrics for the Kuhn-Munkres bipartite set algorithm, and executes the algorithm in relevant subsets, and in a stratified manner that ensures that annotations are compared in an appropriate order. Here's how it works.

Similarity profiles

The MAT comparison algorithm begins with a set of similarity profiles. These are declarative descriptions, for particular labels, how these labels should be compared: what dimensions of the annotation to use, how the dimensions should be compared, how the similarity of each dimension contributes to the similarity of the whole.

The <similarity_profile> element is an immediate child of the toplevel <task> element. Here's an example:

<similarity_profile>
<tag_profile true_labels="PERSON,ORGANIZATION">
<dimension name="_label" weight="2"/>
<dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
<dimension name="nomtype" weight="1"/>
</tag_profile>
</similarity_profile>

So consider a PERSON annotation over characters 10 - 20 with nomtype = PRO in document 1, vs. an ORGANIZATION annotation over characters 11 - 20 with nomtype = NAM in document 2. The labels don't match (so the _label dimension contribution will be 0 of a possible 2); the nomtype values don't match (so the nomtype dimension contribution will be 0 of a possible 1), and the spans overlap in 9 out of a possible 10 characters (so the _span contribution will be 8 out of a possible 8, since the overlap is .9 and .8 is the lower bound for what counts as a full match). So the similarity of these two annotations, given this profile, would be (0 * 2) + (8 * 8) + (0 * 1)/(2 + 8 + 1), or about .73.

As you can see, you can specify special dimensions, like _span or _label, or attributes of the annotations listed in the true_labels. We describe the defaults, and the currently available methods and dimensions, here.

Labels and annotation sets

In various locations in the similarity profiles (and score profiles, too), you have the opportunity to specify a list of true labels, as in the <tag_profile> element shown above. Wherever you can specify true labels, you can also specify the name of an annotation set, using the prefix "set:". E.g., in the example above, if the annotation set descriptor named "content" consisted entirely of the PERSON and ORGANIZATION labels, you could also write this similarity profile as follows:

<similarity_profile>
<tag_profile true_labels="set:content">
<dimension name="_label" weight="2"/>
<dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
<dimension name="nomtype" weight="1"/>
</tag_profile>
</similarity_profile>

This shorthand is available for any value of the "true_labels" attribute in any element of the similarity or score profiles. It refers exclusively to the labels mentioned in the referenced sets; if the referenced annotation set descriptor defines an attribute which refers to an annotation label defined in another annotation set descriptor, that label will not be included in the expanded set of true labels.

The Kuhn-Munkres algorithm

Once we establish the similarities, we invoke the Kuhn-Munkres bipartite set algorithm (otherwise known as the Hungarian algorithm). Kuhn-Munkres gives you the best match among two sets, given similarity measures among the set elements. The similarity profiles provide these measures. Once we run Kuhn-Munkres, we have a pairing between the annotations in the two comparison sets, along with a similarity measure for each best pair.

Segmenting Kuhn-Munkres

When we run Kuhn-Munkres, we don't need to compare all the annotations in one document with all the annotations in the other. For instance, for spanned annotations, we stipulate that if two annotations don't overlap at all, they can't form a pair. So we can subdivide the spanned annotations into candidate overlap sets, and run Kuhn-Munkres over many small sets instead of one big one. In addition, if a region of text has annotations in one document but none in the other, none of those annotations can be paired, and we don't need to run Kuhn-Munkres at all for that subset of annotations.

If we're trying to compare spanless annotations, we have two choices. We first attempt to compute an "implied span", which reaches from the start index of the earliest annotation that this annotation "points to" (i.e., has as a value of an annotation-valued attribute), to the end index of the latest annotation that this annotation "points to". This computation is performed recursively. We subdivide all spanless annotations for which we can compute an implied span in the same way we subdivide spanned annotations; for all other annotations, we subdivide by label.

Stratifying Kuhn-Munkres

The presence of annotations which have annotation-valued attributes (e.g., relations and events) poses a problem for the simple case. Such annotations imply that what we should be doing is finding a best graph alignment, rather than a best set alignment. We've chosen not to deal with graph alignment. Instead, we've chosen to require that the similarity algorithm be stratified: all annotations which are "pointed to" must be paired before the annotations that "point to" them are paired. The output of the earlier stratum is used as input to the later stratum; by default, annotation-valued attributes cannot match if their values were not paired on a previous stratum.

This stratification is enforced on a label-by-label basis, not an annotation-by-annotation basis. The analysis is static: if an annotation label is listed in the label restriction of an annotation-valued attribute in your task, it has to be paired before the annotation that bears the annotation-valued attribute. E.g. in:

<span label="PERSON">
<span label="EVENT">
<filler name="arg1" filler_types="PERSON"/>
</span>

the PERSON annotations must be guaranteed to be paired before the EVENT annotations.

This guarantee can be imposed in one of two ways. Each stratum is internally processed so that spanned annotations are paired first, and by default, all annotations are in a single stratum. So if your task contains some spanned and some spanless annotations, the spanned annotations will be paired before the spanless annotations in each stratum. If you provide no strata, they will be computed for you, or you can define strata explicitly:

<similarity_profile>
<stratum true_labels="PERSON"/>
<stratum true_labels="EVENT"/>
...
</similarity_profile>

The order of the <stratum> elements determines the stratum order.

The stratification requirement allows us to continue to use a set alignment algorithm, rather than a graph alignment algorithm, but also leads to a couple of restrictions. First, the overall annotation pairing isn't a true global best pair, because we can't use the annotations that "point to" a given annotation as a candidate comparison dimension. Second, as we noted in the introduction, annotation descriptions which have cycles can't be compared:

<span label="PERSON">
<span label="EVENT">
<filler name="arg1" filler_types="PERSON EVENT"/>
</span>

There's nothing incoherent about such an annotation schema; it just can't be compared or scored using the algorithm in MAT.

Note: the Kuhn-Munkres stratification is completely independent of the partition into annotation sets. This is intentional; we don't believe that these things should be coupled to each other.

Defining similarity profiles

Each similarity profile consists of zero or more <stratum> declarations, and zero or more <tag_profile> declarations. The labels referenced in the similarity profiles are true, not effective, labels.

Strata

The <stratum> declarations are present to enforce the stratification requirements of the comparison algorithm. If your profile contains no <stratum> declarations, the comparison algorithm will attempt to create a stratum order which respects the stratification requirements. If it isn't possible to do this, the comparison algorithm will raise an error. If you declare at least one <stratum>, and your strata do not mention all your content labels, the comparison algorithm may also raise an error. So if you have to declare strata, be sure to mention all your content labels. You define a stratum like this:

<similarity_profile>
<stratum true_labels="label1,label2,..."/>
...
</similarity_profile>

Tag profiles

Each <tag_profile> element defines the way a set of true labels are compared to each other. All the labels listed must have the dimensions specified; if the dimensions are annotation attributes, the attribute must have the same type and aggregation in each annotation. Here's our example again:

<similarity_profile>
<tag_profile true_labels="PERSON,ORGANIZATION">
<dimension name="_label" weight="2"/>
<dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
<dimension name="nomtype" weight="1"/>
</tag_profile>
</similarity_profile>

Here, both PERSON and ORGANIZATION must be spanned annotations (since the _span dimension is present) and both must have a "nomtype" attribute, and the type and aggregation of that attribute must be the same in each (e.g., it can't be a string for PERSON and a set of integers for ORGANIZATION).

If the annotations have attributes you want to compare, but the attributes have different names, you can use the <attr_equivalences> element to address that:

<similarity_profile>
<tag_profile true_labels="PERSON,ORGANIZATION">
<attr_equivalences name="nomtype" equivalences="p_nomtype,o_nomtype"/>
<dimension name="_label" weight="2"/>
<dimension name="_span" method="overlap" weight="8" overlap_match_lower_bound=".8"/>
<dimension name="nomtype" weight="1"/>
</tag_profile>
</similarity_profile>

So in this example, if PERSON has the p_nomtype attribute, and ORGANIZATION has the o_nomtype attribute, you can treat them as the same attribute for the purposes of comparison. Note that if you use an attribute equivalence, the dimensions must refer to the name of the equivalence.

If the algorithm has to compare two annotations which are not in the same <tag_profile>, it will use only the non-attribute equivalences for comparison, and use the weight of the attribute dimensions as "dead weight" (i.e., as if the comparison score for those dimensions is 0). It will compare annotation A to annotation B using annotation A's profile, and then compare them using annotation B's profile, and take the smaller of the two. In other words, the comparison scores of annotations which are not in the same <tag_profile> is strongly penalized.

Dimensions

Each dimension must have a name and a weight. You can also specify a method; every dimension has a default method, and a default means of applying the specified method to aggregate types (lists and sets). At the moment, only set aggregates have an implemented default comparison method (we haven't had much use for the list types yet). If you don't specify a method, the default method will be used. Some methods have optional parameters which can also be specified in the <dimension> element, as show in the example above for _span. Each method returns a value between 0 and 1, and this value is multiplied by the weight to determine the total contribution to the similarity that this dimension makes.

It's possible to define your own dimension methods, but describing this process is beyond the scope of this documentation.

The available dimensions are:

"_label"

This method compares the two annotation labels. If the annotation has an effective label, that label will be used; if the annotation label is mentioned in an equivalence class when the scorer is invoked (see the --equivalence_class option of MATScore), the equivalence class name will be used. The default _label method is "label_equality". This method accepts an optional "true_residue" option, which is a float between 0 and 1; if there are effective label values that don't match, but the underlying true labels do, this value will be returned as the similarity. Examples:

<dimension name="_label" weight="2"/>
<dimension name="_label" method="label_equality" weight="2"/>
<dimension name="_label" weight="2" true_residue=".5"/>

"_span"

This method compares the two spans. It can only be used with spanned annotations; it cannot be used to compare the "implied span" used when segmenting the Kuhn-Munkres algorithm. The default _span method is "overlap". This method returns the fraction of the combined span of the two annotations in which they overlap; so if one annotation covers characters 10 - 15, and another annotation covers characters 12 - 20, the combined span is 10 - 20, and the overlap is 12 - 15, so the method returns .3.

In general, you'll probably prefer the default behavior, which causes the span similarity to be proportional to the overlap. However, if you wish to assign a threshold to the similarity, this method accepts two optional options: "overlap_match_lower_bound", which specifies an overlap value above which 1.0 will be returned, and "overlap_mismatch_upper_bound", which specifies an overlap value below which 0 will be returned. Both of these endpoints are non-inclusive. Examples:

<dimension name="_span" method="overlap" weight="4"/>
<dimension name="_span" weight="8" overlap_match_lower_bound=".8"/>

attributes and attribute equivalences

Every annotation attribute is a dimension. If a dimension name isn't one of the special cases listed here, it's interpreted as an attribute (or as an attribute equivalence, if such an equivalence is declared). If one of your attributes has the same name as one of the special cases documented here, you won't be able to refer to it in your similarity description.

For attributes of type other than "annotation", the default comparison method is called "equality", and simply returns 1 if the two values are equal, 0 otherwise. If the attribute is a set aggregation, the value is the ratio of the size of the intersection of the set values over the size of the union of the set values.

For attributes of type "annotation", the default comparison method is called "similarity", and it returns 0 if the annotation values were not paired in a previous stratum, and the computed similarity value otherwise. If the attribute is a set aggregation, the Kuhn-Munkres algorithm is invoked to compare the sets, using the "similarity" comparison method to provide the similarity weights, and the overall similarity is the sum of the similarities found by the Kuhn-Munkres algorithm, divided by the maximum possible similarity.

Example:

<dimension name="nomtype" weight="1"/>

multiple attributes

If a dimension name contains a comma, it's treated as a set of attributes to be considered together. Such a dimension must have a method specified (and there will be no aggregate handler for that method). At the moment, the only known method for this case is "_annotation_set_similarity". This method is intended to be used in the case where there are two or more annotation-valued attributes which are interchangeable, as in the case of a symmetric relation like "related-to". The implementation of this method is essentially identical to the method for comparing annotation-valued attributes which are set aggregations. Example:

<dimension name="arg1,arg2" weight="2" method="_annotation_set_similarity"/>

"_nonannotation_attribute_remainder"

This special method is used primarily in the default profiles specified below. It looks at all the attributes that

and compares them. If there are no such attributes, this dimension is ignored; i.e., its weight will be treated as zero. Otherwise, the similarity is the sum of the similarities of these otherwise-undeclared attributes divided by the number of attributes.

"_annotation_attribute_remainder"

This special method is used primarily in the default profiles specified below. It looks at all the attributes that

and collapses all the values for these attributes into a single set for each annotation (including the members of any set aggregations), and compares those sets using Kuhn-Munkres. The idea here is that when comparing relations in a default manner, the actual names of the arguments aren't nearly as important as they are in the nonannotation case, and should be ignored.

The weight for this dimension is based on how many of the named relevant attributes in the two annotations could be paired. So, for example, if the names of the relevant attributes are identical in the two annotations being paired, it's possible for this dimension to match perfectly. If, on the other hand, some of the attribute names do not match, the maximum score for the dimension will be limited by the number of non-matching attribute names. The intention here is that this dimension will be able to yield perfect matches for annotations which are clearly intended to match (i.e., they have the same label), and will still increase the similarity of annotations which can't match perfectly, if some of the annotation-valued attributes overlap in their values.

As in the _nonannotation_attribute_remainder case above, if there are no attributes which meet the requirements, this dimension's weight will be treated as zero.

"_remainder"

This special dimension is a "macro" dimension - it stands for all remaining unclaimed common attributes which match a particular description. It takes the special attributes "attrs", "exclude_attrs", "type",  and "aggregation":

<dimension name="_remainder" exclude_attrs="attr_a,attr_b" type="annotation" weight= .../>

"attrs" and "exclude_attrs" are comma-separated sequences of attribute names, as shown. "type" is one of the legal attribute types ("string", "int", "annotation", "boolean", "float"). "aggregation" is one of the legal aggregations ("list", "set", "none").

Remainder dimensions are processed after all other dimensions, including _annotation_attribute_remainder and _nonannotation_attribute_remainder; if both of those special dimensions are present, all remainders are ignored. In order for this macro dimension to apply to an attribute, the attribute:

If the macro dimension applies, the comparison engine defines a dimension for that attribute, as if it had been declared explicitly.

Remainder dimensions are processed in the order they're defined.

Default profiles

If an annotation label is not mentioned in a <tag_profile> element, and it's a spanned label, it will be assigned the profile

<dimension name="_label" weight=".1" true_residue=".5"/>
<dimension name="_span" weight=".9"/>
<dimension name="_nonannotation_attribute_remainder" weight=".1"/>
<dimension name="_annotation_attribute_remainder" weight=".1"/>

If the label is not spanned, it will be assigned the profile

<dimension name="_label" weight=".2" true_residue=".5"/>
<dimension name="_nonannotation_attribute_remainder" weight=".2"/>
<dimension name="_annotation_attribute_remainder" weight=".6"/>

on the assumption that spanless annotations are typically used as relations.

Multiple profiles

You can define as many similarity profiles as you like, naming them via the name attribute:

<similarity_profile name="...">
...
</similarity_profile>

If you leave one of the profiles unnamed, it will be the default profile for the task; if you don't specify a similarity profile, the task default will be used (in the UI, this option will be labeled "(default)"). You can always access the system default profile by using the name "<matdefault>" (on the command line) or "(system default)" (in the UI).

Scoring profiles and the scorer

The similarities among the annotations are used to feed the MAT scorer. At the moment, annotation pairs which have similarity scores of 1.0 are treated as matching; pairs with scores less than 1.0 are treated as clashing; and annotations which are not paired are treated as spurious or missing, as appropriate. There is currently no facility in the MAT scorer for treating a similarity score less than 1.0 as a partial match.

By default, the scorer reports numbers for each true or effective label, and for all labels aggregated together, for each document and for all documents aggregated together. You can modify this behavior using the <score_profile> element, and its child elements.

The <score_profile> element is an immediate child of the toplevel <task> element. Here's an example:

<score_profile>
<aggregation name="loc elements" true_labels="LOCATION,GEOPOLITICAL_ENTITY,FACILITY"/>
<attr_decomposition attrs="nomtype" true_labels="PERSON"/>
<label_limitation true_labels="LOCATION,GEOPOLITICAL_ENTITY,FACILITY,PERSON"/>
<attrs_alone true_labels="PERSON"/>
</score_profile>

Remember, the "set:" shorthand is available for any value of the "true_labels" attribute in any element of the similarity or score profiles.

Aggregations

 The <aggregation> element can be used to specify a label under which the scores for a list of true labels will be aggregated. This aggregation is in addition to the defaults. This level is specified by a list of true labels, and is "in between" the individual labels and the aggregation of all labels.

Decompositions

In addition to aggregations, you can specify decompositions of various labels. You can specify a decomposition by attribute (as shown here), or by use of a method (using the <partition_decomposition> element, not exemplified here). For the specified labels, the scorer will report a separate score for those annotations bearing each of the known values for that label's attributes. So in the example we've been using on this page, if the nomtype attribute has three values NAM, NOM and PRO, the scorer will give you a separate subscore for PERSON annotations where nomtype=NAM, etc.

Label limitations

The scorer gives you the option of ignoring labels, using the --ignore parameter. But this parameter doesn't simply cause the score output to ignore the annotations; it removes them from consideration entirely, as if they aren't present at all. This can lead to problems if, e.g., you're scoring relations, and you only want to see the relation scores, but you can't ignore the entity annotations, since they're a crucial ingredient to the relation comparison. You can use the <label_limitation> element to list just those labels you want to see in the score output.

Attributes alone

The attribute decompositions will show you annotation scores for annotations which bear particular attributes. If, on the other hand, you want to know the scores for the attributes themselves - if, e.g., you have an attribute trainer/tagger - you can use the <attrs_alone> element to produce these separate scores. If an annotation's true label is specified in <attrs_alone>, the scorer will add an "attr alone" column to the scoring output, and for each pair of annotations, it will report the matches and clashes, and generate scores accordingly. It will do this for any label decomposition or aggregation the annotation is eligible for.

The attributes will be reported for the various pairing dimensions as follows:

Note that you won't necessarily get as many total elements for the attributes alone as you will for the labels themselves; for annotations which are not paired (i.e., missing or spurious), there will be no entry. Also, keep in mind that the attributes can match when the labels don't, and in this case, the behavior is identical to the behavior you find when annotations match in spite of a label clash (e.g., if the label is not part of the similarity profile): the match will be recorded under the attribute of the reference annotation.

Multiple profiles

Like <similarity_profile>, you can define as many score profiles as you like, naming them via the name attribute:

<score_profile name="...">
...
</score_profile>

If you leave one of the profiles unnamed, it will be the default profile for the task; if you don't specify a score profile, the task default will be used. You can always access the system default profile by using the name "<default>".

Acknowledgements

The inspiration for configurable comparison is due to the scorer used for the ACE 2005 evaluation. The inspiration for using Kuhn-Munkres as a basis for a general-purpose comparison algorithm is due to

Xiaoqiang Luo, "On Coreference Resolution Performance Metrics", Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 25-32, Vancouver, October 2005.