This document describes the output of the MATScore tool. There are several spreadsheets which can be produced: tag-level scores, token-level scores, character-level scores, "pseudo-token"-level scores, and details. By default, only the tag-level scores are produced.
The scorer uses a sophisticated pairing
algorithm to determine which annotation pairs should
generate the scores.
Throughout the scorer, we use the notion of effective label, which we
describe here and here. Whenever
possible, the scorer will use effective labels to display its
scores.
The four score tables have the following columns (if your scorer
output was generated by MATExperimentEngine,
the initial columns will look a
bit different):
Column |
Description |
---|---|
similarity profile |
The similarity profile used
to generate the similarity scores for the annotations. |
score profile |
The score profile used to group the output
scores. |
file |
The file basename of the document being
scored. |
test docs |
The number of test
(hypothesis) documents. This value will be the same for all
rows. |
tag | The true or effective label which is being scored in this row. The final row will be a cumulative score, with label "<all>". |
tag subset |
Optional. This column lists the particular
subset of the tag instances to be scored, if such a
decomposition is described in the score profile. When this
column is present and no tag subset applies, the value will
be "<none>". |
attr alone |
Optional. This column lists the attribute
score for the specified attribute of the tag instances to be
scored, if these attribute scores were requested via
<attrs_alone> in the score profile. When this column
is present and no attr score is being reported, the value
will be "<n/a>". |
test toks |
The number of tokens in the
test documents. This value will be the same for all rows. |
match |
The number of elements for this true or effective label whose pairs have a perfect similarity score. If the similarity profile ignores the label dimension, the matches will be recorded under the reference label. |
refclash |
The number of elements which
bear this true or effective label in the reference document
which are paired with annotations in the corresponding
hypothesis document but do not have a perfect similarity
score. The scorer does not yet make it possible to make this
value the sum of the similarity scores, rather than the
count of elements. |
missing |
The number of elements which
bear this true or effective label in the reference document
but are not paired with any element in the corresponding
hypothesis document. |
refonly |
refclash + missing |
reftotal |
refonly + match |
hypclash |
The number of elements which bear this true or effective label in the hypothesis document which are paired with annotations in the corresponding reference document but do not have a perfect similarity score. The scorer does not yet make it possible to make this value the sum of the similarity scores, rather than the count of elements. |
spurious |
The number of elements which bear this true or effective label in the hypothesis document but are not paired with any element in the corresponding reference document. |
hyponly |
hypclash + spurious |
hyptotal |
hyponly + match |
precision |
match / hyptotal |
recall |
match / reftotal |
fmeasure |
2 * ((precision * recall) /
(precision + recall)) |
For tag-level scores, the elements counted in the match,
refclash, missing, hypclash and spurious columns are annotations;
for the other scores, the elements counted are the basic elements
for the table (tokens, pseudo-tokens, or characters).
When the user requests confidence data in MATScore via the --compute_confidence_data option, the scorer adds three columns (mean, variance and standard deviation) to the spreadsheet for each of the computed metrics (precision, recall, f-measure). These columns appear immediately to the right of the column for the metric.
Slug |
Dimension being compared |
Meaning |
---|---|---|
overmark |
span |
reference span is contained within than the
hypothesis span |
undermark |
span |
hypothesis span is contained within than the
reference span |
overlap |
span |
reference and hypothesis spans overlap |
tagclash |
label |
labels do not match |
computedtagclash |
label |
true labels match, but value of effective
label feature does not |
attrclash |
non-aggregate attribute |
attribute values clash (one or the other may
be null, or both may have values) |
attrsetclash |
set-valued attribute |
set values do not match (clashes among the
individual set members are not reported) |
annattrnotpaired |
non-aggregate annotation-valued attribute | the reference and hypothesis values are not
paired with each other according to the pairing algorithm |
annattrclash |
non-aggregate annotation-valued attribute |
annotation attributes do not match |
The token, character and pseudo-token tables each use a different basic element for their element counts. Because these elements are fixed across the reference and hypothesis documents, there are no span clashes in these score tables. The "test toks" column will be labeled "test chars" and "test pseudo-toks" in the last two spreadsheets.
The fixed-span score tables have some additions to the core
column set. The additional columns are:
Column |
Description |
---|---|
tag_sensitive_accuracy |
(test toks - refclash -
missing - spurious)/test toks (essentially, the fraction of
tokens in the reference which were tagged correctly,
including those which were not tagged at all) |
tag_sensitive_error_rate |
1 - tag_sensitive_accuracy |
tag_blind_accuracy |
(test toks - missing - spurious)/test toks (essentially, the fraction of tokens in the reference which were properly assigned a tag - any tag) |
tag_blind_error_rate |
1 - tag_blind_accuracy |
If user requests confidence data, it will be reported for all
four of these additional columns.
The token-level score elements are generated by whatever
tokenizer was used to tokenize the scored documents. If no
tokenizer was used, you won't be able to produce token-level
scores. Character scores, on the other hand, are always available,
because the characters themselves serve as the basic elements.
We've included character-level scoring to provide sub-tag-level
granularity in situations where tokenization hasn't been performed
or isn't available for some reason (although nothing stops you
from using these methods alongside token-level scores). In
addition, in the interest of producing something more "token-like"
in the absence of actual tokenization, we've designed a notion of
"pseudo-token". To compute the pseudo-tokens for a document, we
collect the set of start and end indices for the content
annotations in both the reference and hypothesis documents, order
the indices, and count the whitespace-delimited tokens in each
span, including the edge spans of the document. This count will
be, at minimum, the number of whitespace-delimited tokens in the
document as a whole, but may be greater, if annotation boundaries
don't abut whitespace.
For example, consider this deeply artificial example:
ref: the future <NP>President of the United State</NP>s
hyp: the<NP> future President of the Unit</NP>ed States
The pseudo-tokens in this document are computed as follows:
The granularity of pseudo-tokens is hopefully more informative
than character granularity for those languages which are
substantially whitespace-delimited, without having to make any
complex, and perhaps irrelevant, decisions about tokenization.
Using both the whitespace boundaries and the annotation boundaries
as region delimiters allows us to deal with the minimum level of
granularity that the pair of documents in question requires to
account for all the annotation contrasts. We recognize that this
is a novel approach, but we hope it will be useful.
Note: unlike token and
character scores, the number of pseudo-tokens is a function of the
overlaps between the reference and hypothesis. Therefore, the
actual number of pseudo-tokens in the document will vary slightly
depending on the performance and properties of your tagger. Do not be alarmed by this.
The detail spreadsheet is intended to provide a span-by-span
assessment of the scoring inputs.
Column |
Description |
---|---|
file |
the name of the hypothesis
from which the entry is drawn |
type |
one of missing, spurious,
match (the meaning of these values should be clear from the
preceding discussion), or one of the error causes described
above |
refid |
the ID of the annotation in the reference
document, if the ID exists (used for cross-referencing with
annotation attribute values) |
hypid |
the ID of the annotation in the hypothesis document, if the ID exists (used for cross-referencing with annotation attribute values) |
refdescription |
the description of the annotation in the
reference document |
hypdescription |
the description of the annotation in the
hypothesis document |
reflabel |
the label on the annotation
in the reference document |
refstart |
the start index, in
characters, of the annotation in the reference document, if
spanned |
refend |
the end index, in characters,
of the annotation in the reference document, if spanned |
hyplabel |
the label on the annotation in the hypothesis document |
hypstart |
the start index, in
characters, of the annotation in the hypothesis document, if
spanned |
hypend |
the end index, in characters,
of the annotation in the hypothesis document, if spanned |
refcontent |
the text between the start
and end indices of the annotation in the reference document,
if spanned |
hypcontent |
the text between the start
and end indices of the annotation in the hypothesis
document, if spanned |
In addition to these columns, if the task contains an
explicitly-defined similarity profile (i.e., not the default)
which specifies dimensions other than the label and span, the
mismatch type associated with each dimension will be listed, one
per column, immediately after the "hypend" column.