Score output

This document describes the output of the MATScore tool. There are several spreadsheets which can be produced: tag-level scores, token-level scores, character-level scores, "pseudo-token"-level scores, and details. By default, only the tag-level scores are produced.

Annotation pairing and label matching

The scorer uses a sophisticated pairing algorithm to determine which annotation pairs should generate the scores.

Throughout  the scorer, we use the notion of effective label, which we describe here and here. Whenever possible, the scorer will use effective labels to display its scores.

All score tables

The four score tables have the following columns (if your scorer output was generated by MATExperimentEngine, the initial columns will look a bit different):

similarity profile
The similarity profile used to generate the similarity scores for the annotations.
score profile
The score profile used to group the output scores.
file
The file basename of the document being scored.
test docs
The number of test (hypothesis) documents. This value will be the same for all rows.
tag The true or effective label which is being scored in this row. The final row will be a cumulative score, with label "<all>".
tag subset
Optional. This column lists the particular subset of the tag instances to be scored, if such a decomposition is described in the score profile. When this column is present and no tag subset applies, the value will be "<none>".
attr alone
Optional. This column lists the attribute score for the specified attribute of the tag instances to be scored, if these attribute scores were requested via <attrs_alone> in the score profile. When this column is present and no attr score is being reported, the value will be "<n/a>".
test toks
The number of tokens in the test documents. This value will be the same for all rows.
match
The number of elements for this true or effective label whose pairs have a perfect similarity score. If the similarity profile ignores the label dimension, the matches will be recorded under the reference label.
refclash
The number of elements which bear this true or effective label in the reference document which are paired with annotations in the corresponding hypothesis document but do not have a perfect similarity score. The scorer does not yet make it possible to make this value the sum of the similarity scores, rather than the count of elements.
missing
The number of elements which bear this true or effective label in the reference document but are not paired with any element in the corresponding hypothesis document.
refonly
refclash + missing
reftotal
refonly + match
hypclash
The number of elements which bear this true or effective label in the hypothesis document which are paired with annotations in the corresponding reference document but do not have a perfect similarity score. The scorer does not yet make it possible to make this value the sum of the similarity scores, rather than the count of elements.
spurious
The number of elements which bear this true or effective label in the hypothesis document but are not paired with any element in the corresponding reference document.
hyponly
hypclash + spurious
hyptotal
hyponly + match
precision
match / hyptotal
recall
match / reftotal
fmeasure
2 * ((precision * recall) / (precision + recall))

For tag-level scores, the elements counted in the match, refclash, missing, hypclash and spurious columns are annotations; for the other scores, the elements counted are the basic elements for the table (tokens, pseudo-tokens, or characters).

Confidence data

When the user requests confidence data in MATScore via the --compute_confidence_data option, the scorer adds three columns (mean, variance and standard deviation) to the spreadsheet for each of the computed metrics (precision, recall, f-measure). These columns appear immediately to the right of the column for the metric.

Optional error detail columns in the tag spreadsheet

When the pairing algorithm produces a similarity score for a pair of annotations, and that pair is not perfect (between 0 and 1.0), the pairing algorithm optionally records a "slug" for the cause of the mismatch. If you specify the --tag_output_mismatch_details flat in MATScore, the accumulated causes will be tabulated and reported in the tag spreadsheet. These cause slugs vary by similarity dimension: overmark and undermark for spans, tagclash for labels, setclash (set mismatch) for set-valued attributes, etc. A pairing may contribute multiple causes; so if a pair exhibits both a span and label mismatch, both will be recorded. So the sum of the causes will not correspond to the counts in the refclash or hypclash columns.

If these causes are displayed, a column for each cause will appear immediately after the refclash and hypclash columns in the tag spreadsheet. For each <slug>, the reference column will be named "ref<slug> (detail)" and the hypothesis column will be named "hyp<slug> (detail)".

Fixed-span scores (token, character, pseudo-token)

The token, character and pseudo-token tables each use a different basic element for their element counts. Because these elements are fixed across the reference and hypothesis documents, there are no span clashes in these score tables. The "test toks" column will be labeled "test chars" and "test pseudo-toks" in the last two spreadsheets.

The fixed-span score tables have some additions to the core column set. The additional columns are:

tag_sensitive_accuracy
(test toks - refclash - missing - spurious)/test toks (essentially, the fraction of tokens in the reference which were tagged correctly, including those which were not tagged at all)
tag_sensitive_error_rate
1 - tag_sensitive_accuracy
tag_blind_accuracy
(test toks - missing - spurious)/test toks (essentially, the fraction of tokens in the reference which were properly assigned a tag - any tag)
tag_blind_error_rate
1 - tag_blind_accuracy

If user requests confidence data, it will be reported for all four of these additional columns.

Pseudo-token scores

The token-level score elements are generated by whatever tokenizer was used to tokenize the scored documents. If no tokenizer was used, you won't be able to produce token-level scores. Character scores, on the other hand, are always available, because the characters themselves serve as the basic elements.

We've included character-level scoring to provide sub-tag-level granularity in situations where tokenization hasn't been performed or isn't available for some reason (although nothing stops you from using these methods alongside token-level scores). In addition, in the interest of producing something more "token-like" in the absence of actual tokenization, we've designed a notion of "pseudo-token". To compute the pseudo-tokens for a document, we collect the set of start and end indices for the content annotations in both the reference and hypothesis documents, order the indices, and count the whitespace-delimited tokens in each span, including the edge spans of the document. This count will be, at minimum, the number of whitespace-delimited tokens in the document as a whole, but may be greater, if annotation boundaries don't abut whitespace.

For example, consider this deeply artificial example:

ref: the future <NP>President of the United State</NP>s
hyp: the<NP> future President of the Unit</NP>ed States

The pseudo-tokens in this document are computed as follows:

The granularity of pseudo-tokens is hopefully more informative than character granularity for those languages which are substantially whitespace-delimited, without having to make any complex, and perhaps irrelevant, decisions about tokenization. Using both the whitespace boundaries and the annotation boundaries as region delimiters allows us to deal with the minimum level of granularity that the pair of documents in question requires to account for all the annotation contrasts. We recognize that this is a novel approach, but we hope it will be useful.

Note: unlike token and character scores, the number of pseudo-tokens is a function of the overlaps between the reference and hypothesis. Therefore, the actual number of pseudo-tokens in the document will vary slightly depending on the performance and properties of your tagger. Do not be alarmed by this.

Details

The detail spreadsheet is intended to provide a span-by-span assessment of the scoring inputs.

file
the name of the hypothesis from which the entry is drawn
type
one of missing, spurious, match (the meaning of these values should be clear from the preceding discussion), or one of the error causes described above
refid
the ID of the annotation in the reference document, if the ID exists (used for cross-referencing with annotation attribute values)
hypid
the ID of the annotation in the hypothesis document, if the ID exists (used for cross-referencing with annotation attribute values)
refdescription
the description of the annotation in the reference document
hypdescription
the description of the annotation in the hypothesis document
reflabel
the label on the annotation in the reference document
refstart
the start index, in characters, of the annotation in the reference document, if spanned
refend
the end index, in characters, of the annotation in the reference document, if spanned
hyplabel
the label on the annotation in the hypothesis document
hypstart
the start index, in characters, of the annotation in the hypothesis document, if spanned
hypend
the end index, in characters, of the annotation in the hypothesis document, if spanned
refcontent
the text between the start and end indices of the annotation in the reference document, if spanned
hypcontent
the text between the start and end indices of the annotation in the hypothesis document, if spanned

In addition to these columns, if the task contains an explicitly-defined similarity profile (i.e., not the default) which specifies dimensions other than the label and span, the mismatch type associated with each dimension will be listed, one per column, immediately after the "hypend" column.