Scoring Engine

Description

The scoring engine compares two tagged files, or two directories of tagged files. Typically, one input is the hypothesis (an automatically tagged file) and the other is the reference (a gold-standard tagged file). But this tool can be used to compare any two inputs.

For a description of the pairing and scoring algorithm, see here. For a description of the scorer output, see here.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATScore

Windows native:

> %MAT_PKG_HOME%\bin\MATScore.cmd

Usage: MATScore [options]

Core options

--task <task>
Optional. If specified, the scorer will use the tags (or tag+attributes) specified in the named task.
--debug
Enable debug output.
--content_annotation_sets set,set,set...
Optional. If --task is specified, the annotations which are processed by default are those whose labels are in annotation sets which are not in the admin, zone or token categories (the "content annotations"). If you want to replace these default annotation sets, use this flag to provide a comma-separated sequence of set names (or category names using the 'category:' prefix). The sets specified will operate as a filter on the available annotations.
--exclude_annotation_sets set,set,set...
Optional. If --task is specified, the annotations which are processed by default are those whose labels are in annotation sets which are not in the admin, zone or token categories (the "content annotations"). If you want to exclude some of these sets or categories, use this flag to provide a comma-separated sequence of set names (or category names using the 'category:' prefix). The sets specified will operate as a filter on the available annotations.
--include_annotation_sets set,set,set...
Optional. If --task is specified, the annotations which are processed by default are those whose labels are in annotation sets which are not in the admin, zone or token categories (the "content annotations"). If you want to include other sets or categories, use this flag to provide a comma-separated sequence of set names (or category names using the 'category:' prefix). The sets specified will operate as a filter on the available annotations.
--content_annotations ann,ann,ann...
Optional. If no task is specified, the scorer requires additional, external information to determine which annotations count as "content annotations" and should be processed. Use this flag to provide a comma-separated sequence of annotation labels which should be treated as content annotations. Ignored if --task is present.
--content_annotations_all
Optional. If no task is specified, the scorer requires additional, external information to determine which annotations count as \"content annotations\" and should be processed. Use this flag to assert that all labels not declared with the --token_annotations option should be treated as content annotations. Ignored if --task is present.
--token_annotations ann,ann,ann...
Optional. If no task is specified, the scorer requires additional, external information to determine which annotations are token annotations. Use this flag to provide a comma-separated sequence of annotation labels which should be treated as token annotations. Ignored if --task is present.
--equivalence_class equivlabel oldlabel,oldlabel,...
Optional and repeatable. In some cases, you may wish to collapse two or more labels into a single equivalence class when you run the scorer. The first argument to this parameter is the label for the equivalence class; the second argument is a comma-separated sequence of existing effective annotation labels.
--ignore label,label,...
Optional. In some cases, you may wish to ignore some labels entirely. The value of this parameter is a comma-separated sequence of effective annotation labels. If an annotation in the reference or hypothesis bears this label, it will be as if the annotation is simply not present.
--similarity_profile profile
If provided, the name of a similarity profile in the provided task. The reserved name "<matdefault>" accesses the system default similarity profile, ignoring any profiles defined in the task. Ignored if --task is not provided.
--score_profile profile
If provided, the name of a score profile in the provided task. The reserved name "<matdefault>" accesses the system default score profile, ignoring any profiles defined in the task. Ignored if --task is not provided.

Filtering annotations with annotation sets

The interaction between specified, included or excluded annotation sets and the annotations present is subtle. In some cases, annotation sets define attributes which refer to labels which are defined in other annotation sets. In these cases, the scoring algorithm ensures that only the labels and attributes which are made available by the included sets are scored. The result might be that some annotation attributes might be ignored in the pairing algorithm, and some annotations might not be compared on the basis of their labels and spans. These restrictions interact appropriately with the similarity and score profiles.

The scorer will not report on annotation labels designated processable="no".

Hypothesis options

--file <file>
The hypothesis file to evaluate. Must be paired with --ref_file. Either this or --dir must be specified.
--dir <dir>
A directory of files to evaluate. Must be paired with --ref_dir. Either this or --file must be specified.
--file_re <re>
A Python regular expression to filter the basenames of hypothesis files when --dir is used. Optional. The expression should match the entire basename.
--file_type <t>
The file type of the hypothesis document(s). One of the readers. Default is mat-json.
--encoding <e>
Hypothesis file character encoding. Default is the default encoding of the file type. Ignored for file types such as mat-json which have fixed encodings.
--gold_only
Under normal circumstances, if segments are present, all segments are compared. Use this flag to restriction the comparison to those regions which overlap with 'human gold' or 'reconciled' segments in the hypothesis.

Reference options

--ref_file <file>
The reference file to compare the hypothesis to. Must be paired with --file. Either this or --ref_dir must be specified.
--ref_dir <dir>
A directory of files to compare the hypothesis to. Must be paired with --dir. Either this or --ref_file must be specified.
--ref_fsuff_off <suff>
When --ref_dir is used, each qualifying file in the hypothesis dir is paired, by default, with a file in the reference dir with the same basename. This parameter specifies a suffix to remove from the hypothesis file before searching for a pair in the reference directory. If both this and --ref_fsuff_on are present, the removal happens before the addition.
--ref_fsuff_on <suff>
When --ref_dir is used, each qualifying file in the hypothesis dir is paired, by default, with a file in the reference dir with the same basename. This parameter specifies a suffix to add to the hypothesis file before searching for a pair in the reference directory. If both this and --ref_fsuff_off are present, the removal happens before the addition.
--ref_file_type <t>
The file type of the reference document(s). One of the readers. Default is mat-json.
--ref_encoding <e>
Reference file character encoding. Default is the default encoding of the file type. Ignored for file types such as mat-json which have fixed encodings.
--ref_gold_only
Under normal circumstances, if segments are present, all segments are compared. Use this flag to restrict the comparison to those regions which overlap with 'human gold' or 'reconciled' segments in the reference.

Score output options

Note that all the CSV files created by the scorer are in UTF-8 encoding.

--tag_output_mismatch_details
By default, the tag scores, like the other scores, present a single value for all the mismatches. If this option is specified, the tag scores will provide a detailed breakdown of the various mismatches: overmarks, undermarks, overlaps, label clashes, etc.
--details
If present, generate a separate spreadsheet providing detailed alignments of matches and errors. See this special note on viewing CSV files containing natural language text.
--confusability
If present, generate a separate spreadsheet providing a token- or pseudo-token-level confusability matrix for all paired tokens. If any token is paired more than once, the confusability matrix will not be generated (because the result would make no sense). The null label comparisons are included in the matrix.
--by_token
By default, the scorer generates aggregate tag-level scores. If this flag is present, generate a separate spreadsheet showing aggregate token-level scores. Note: in order for token-level scoring to work, the hypothesis document must be contain token annotations, and the content annotation boundaries in both documents must align with token annotation boundaries. If there are no token annotations, no token-level scores will be generated; if one or both documents contain token annotations but they're not aligned with content annotations, the behavior is undefined.
--by_pseudo_token
By default, the scorer generates aggregate tag-level scores. If this flag is present, generate a separate spreadsheet showing aggregate scores using what we're call 'pseudo-tokens', which is essentially the spans created by the union of whitespace boundaries and span boundaries. For English and other Roman-alphabet languages, this score should be very, very close to the token-level score, without requiring the overhead of having actual token annotations in the document. For details about pseudo-tokens, see here.
--by_character
By default, the scorer generates aggregate tag-level scores. If this flag is present, generate a separate spreadsheet showing aggregate character-scores. For languages like Chinese, this score may provide some useful sub-phrase metrics without requiring the overhead of having token annotations in the document.
--compute_confidence_data
If present, the scorer will compute means and variances for the various metrics provided in the tag and token spreadsheets, if --csv_output_dir is specified.
--csv_output_dir <dir>
By default, the scorer formats text tables to standard output. If this flag is present, the scores (if requested) will be written as CSV files to <dir>/bytag_<format>.csv, <dir>/bytoken_<format>.csv, <div>/bypseudotoken_<format>.csv, <dir>/bychar_<format>.csv, <dir>/details.csv, and <dir>/confusability.csv. The value or values for <format> are governed by the --csv_formula_output option.
--csv_formula_output <s>
A comma-separated list of options for CSV output. The possibilities are 'oo' (formulas with OpenOffice separators), 'excel' (formulas with Excel separators), 'literal' (no formulas). The scorer will produce CSV output files for each of the conditions you specify. By default, if --csv_output_dir is specified, this value is 'excel'. Note that the OpenOffice and Excel formula formats are incompatible with each other, so you'll only be able to open output files with Excel separators in Excel, etc.

When the user requests confidence data via the --compute_confidence_data option, the scorer produces 1000 alternative score sets. Each score set is created by making M random selections of file scores from the core set of M file scores. (This procedure will naturally have multiple copies of some documents and no copies of others in each score set, which is the source of the variation for this computation.) The scorer then computes the overall metrics for each alternative score set, and computes the mean and variance over the 1000 instances of each of the precision, recall, and fmeasure metrics. This "sampling with replacement" yields a more stable mean and variance.

Other options

The readers referenced in the --file_type and --ref_file_type options may introduce additional options, which are described here. These additional options must follow the --file_type and --ref_file_type options. The options for the reference file types are all prepended with a ref_ prefix; so for instance, to specify the --xml_input_is_overlay option for xml-inline reference documents, use the option --ref_xml_input_is_overlay.

Examples

Example 1

Let's say you have two files, /path/to/ref and /path/to/hyp, which you want to compare. The default settings will print a table to standard output.

Unix:

% $MAT_PKG_HOME/bin/MATScore --file /path/to/hyp --ref_file /path/to/ref

Windows native:

> %MAT_PKG_HOME%\bin\MATScore.cmd --file c:\path\to\hyp --ref_file c:\path\to\ref

Example 2

Let's say that instead of printing a table to standard output, you want to produce CSV output with embedded Excel formulas (the default formula type), and you want all three spreadsheets.

Unix:

% $MAT_PKG_HOME/bin/MATScore --file /path/to/hyp --ref_file /path/to/ref \
--csv_output_dir $PWD --details --by_token

Windows native:

> %MAT_PKG_HOME%\bin\MATScore.cmd --file c:\path\to\hyp --ref_file c:\path\to\ref \
--csv_output_dir %CD% --details --by_token

This invocation will not produce any table on standard output, but will leave three files in the current directory: bytag_excel.csv, bytoken_excel.csv, and details.csv.

Example 3

Let's say you have two directories full of files. /path/to/hyp contains files of the form file<n>.txt.json, and /path/to/ref contains files of the form file<n>.json. You want to compare the corresponding files to each other, and you want tag and token scoring, but not details, and you intend to view the spreadsheet in OpenOffice.

Unix:

% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --csv_formula_output oo --by_token

Windows native:

> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD%
--csv_formula_output oo --by_token

For each file in /path/to/hyp, this invocation will prepare a candidate filename to look for in /path/to/ref by removing the .txt.json suffix and adding the .json suffix. The current directory will contain bytag_oo.csv and bytoken_oo.csv.

Example 4

Let's say that you're in the same situations as example 3, but you want confidence information included in the output spreadsheets:

Unix:

% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD
--csv_formula_output oo --by_token --compute_confidence_data

Windows native:

> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD%
--csv_formula_output oo --by_token --compute_confidence_data

Example 5

Let's say that you're in the same situation as example 3, but your documents contain lots of tags, but you're only interested in scoring the tags listed in the "Named Entity" task. Furthermore, you're going to import the data into a tool other than Excel, so you want the values calculated for you rather than having embedded equations:

Unix:

% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD -
-csv_formula_output literal --by_token --task "Named Entity"

Windows native:

> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD%
--csv_formula_output literal --by_token --task "Named Entity"

Example 6

Let's say you're in the same situation as example 3, but your reference documents are XML inline documents, and are of the form file<n>.xml. Do this:

Unix:

% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.xml' \
--csv_output_dir $PWD
--csv_formula_output oo --by_token --ref_file_type xml-inline

Windows native:

> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".xml" \
--csv_output_dir %CD%
--csv_formula_output oo --by_token --ref_file_type xml-inline

Note that --ref_fsuff_on has changed, in addition to adding the --ref_file_type option.