The scoring engine compares two tagged files, or two directories
of tagged files. Typically, one input is the hypothesis (an
automatically tagged file) and the other is the reference (a
gold-standard tagged file). But this tool can be used to compare
any two inputs.
For a description of the pairing and scoring algorithm, see here. For a description of the
scorer output, see here.
Unix:
% $MAT_PKG_HOME/bin/MATScore
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd
Usage: MATScore [options]
--task <task> |
Optional. If specified, the
scorer will use the tags (or tag+attributes) specified in
the named task. |
--debug |
Enable debug output. |
--content_annotation_sets set,set,set... |
Optional. If --task is specified, the
annotations which are processed by default are those whose
labels are in annotation sets which are not in the admin,
zone or token categories (the "content annotations"). If you
want to replace these default annotation sets, use this flag
to provide a comma-separated sequence of set names (or
category names using the 'category:' prefix). The sets
specified will operate as a filter on the
available annotations. |
--exclude_annotation_sets set,set,set... |
Optional. If --task is specified, the annotations which are processed by default are those whose labels are in annotation sets which are not in the admin, zone or token categories (the "content annotations"). If you want to exclude some of these sets or categories, use this flag to provide a comma-separated sequence of set names (or category names using the 'category:' prefix). The sets specified will operate as a filter on the available annotations. |
--include_annotation_sets set,set,set... |
Optional. If --task is specified, the annotations which are processed by default are those whose labels are in annotation sets which are not in the admin, zone or token categories (the "content annotations"). If you want to include other sets or categories, use this flag to provide a comma-separated sequence of set names (or category names using the 'category:' prefix). The sets specified will operate as a filter on the available annotations. |
--content_annotations
ann,ann,ann... |
Optional. If no task is
specified, the scorer requires additional, external
information to determine which annotations count as "content
annotations" and should be processed. Use this flag to
provide a comma-separated sequence of annotation labels
which should be treated as content annotations. Ignored if
--task is present. |
--content_annotations_all |
Optional. If no task is specified, the scorer
requires additional, external information to determine which
annotations count as \"content annotations\" and should be
processed. Use this flag to assert that all labels not
declared with the --token_annotations option should be
treated as content annotations. Ignored if --task is
present. |
--token_annotations
ann,ann,ann... |
Optional. If no task is
specified, the scorer requires additional, external
information to determine which annotations are token
annotations. Use this flag to provide a comma-separated
sequence of annotation labels which should be treated as
token annotations. Ignored if --task is present. |
--equivalence_class
equivlabel oldlabel,oldlabel,... |
Optional and repeatable. In
some cases, you may wish to collapse two or more labels into
a single equivalence class when you run the scorer. The
first argument to this parameter is the label for the
equivalence class; the second argument is a comma-separated
sequence of existing effective annotation labels. |
--ignore label,label,... |
Optional. In some cases, you
may wish to ignore some labels entirely. The value of this
parameter is a comma-separated sequence of effective
annotation labels. If an annotation in the reference or
hypothesis bears this label, it will be as if the annotation
is simply not present. |
--similarity_profile profile |
If provided, the name of a similarity profile
in the provided task. The reserved name "<matdefault>"
accesses the system default similarity profile, ignoring any
profiles defined in the task. Ignored if --task is not
provided. |
--score_profile profile |
If provided, the name of a score profile in
the provided task. The reserved name "<matdefault>"
accesses the system default score profile, ignoring any
profiles defined in the task. Ignored if --task is not
provided. |
The interaction between specified, included or excluded
annotation sets and the annotations present is subtle. In some
cases, annotation sets define attributes which refer to labels
which are defined in other annotation sets. In these cases, the
scoring algorithm ensures that only the labels and attributes
which are made available by the included sets are scored. The
result might be that some annotation attributes might be ignored
in the pairing algorithm, and some annotations might not be
compared on the basis of their labels and spans. These
restrictions interact appropriately with the similarity and score
profiles.
The scorer will not report on annotation labels designated processable="no".
--file <file> |
The hypothesis file to
evaluate. Must be paired with --ref_file. Either this or
--dir must be specified. |
--dir <dir> |
A directory of files to
evaluate. Must be paired with --ref_dir. Either this or
--file must be specified. |
--file_re <re> |
A Python regular expression
to filter the basenames of hypothesis files when --dir is
used. Optional. The expression should match the entire
basename. |
--file_type <t> |
The file type of the
hypothesis document(s). One of the readers. Default is
mat-json. |
--encoding <e> |
Hypothesis file character
encoding. Default is the default encoding of the file type.
Ignored for file types such as mat-json which have fixed
encodings. |
--handle_non_bmp <v> |
Instructions on how to handle Unicode
characters outside the Basic Multilingual Plane in the
hypothesis. Overrides the default HANDLE_NON_BMP configuration variable. Value
is one of 'warn', 'scrub_or_warn', 'fail', 'ignore'. See the Unicode issues discussion
for details. Default is 'warn'. |
--gold_only |
Under normal circumstances,
if segments are present, all segments are compared. Use this
flag to restriction the comparison to those regions which
overlap with 'human gold' or 'reconciled' segments in the
hypothesis. |
--ref_file <file> |
The reference file to compare
the hypothesis to. Must be paired with --file. Either this
or --ref_dir must be specified. |
--ref_dir <dir> |
A directory of files to
compare the hypothesis to. Must be paired with --dir. Either
this or --ref_file must be specified. |
--ref_fsuff_off <suff> |
When --ref_dir is used, each
qualifying file in the hypothesis dir is paired, by default,
with a file in the reference dir with the same basename.
This parameter specifies a suffix to remove from the
hypothesis file before searching for a pair in the reference
directory. If both this and --ref_fsuff_on are present, the
removal happens before the addition. |
--ref_fsuff_on <suff> |
When --ref_dir is used, each qualifying file in the hypothesis dir is paired, by default, with a file in the reference dir with the same basename. This parameter specifies a suffix to add to the hypothesis file before searching for a pair in the reference directory. If both this and --ref_fsuff_off are present, the removal happens before the addition. |
--ref_file_type <t> |
The file type of the reference document(s). One of the readers. Default is mat-json. |
--ref_encoding <e> |
Reference file character
encoding. Default is the default encoding of the file type.
Ignored for file types such as mat-json which have fixed
encodings. |
--ref_handle_non_bmp <v> |
Instructions on how to handle Unicode
characters outside the Basic Multilingual Plane in the
reference. Overrides
the default HANDLE_NON_BMP configuration
variable. Value is one of 'warn', 'scrub_or_warn', 'fail',
'ignore'. See the Unicode
issues discussion for details. Default is 'warn'. |
--ref_gold_only |
Under normal circumstances,
if segments are present, all segments are compared. Use this
flag to restrict the comparison to those regions which
overlap with 'human gold' or 'reconciled' segments in the
reference. |
Note that all the CSV files created by the scorer are in UTF-8
encoding.
--tag_output_mismatch_details |
By default, the tag scores,
like the other scores, present a single value for all the
mismatches. If this option is specified, the tag scores will
provide a detailed breakdown of the various mismatches:
overmarks, undermarks, overlaps, label clashes, etc. |
--details |
If present, generate a
separate spreadsheet providing detailed alignments of
matches and errors. See this special note on viewing
CSV files containing natural language text. |
--detail_context_window <width> |
If present and --details is present, provide
a left and right context window of the specified character
width for spanned annotations in the details spreadsheet. |
--normalize_detail_whitespace |
If present and --details is present,
normalize the detail whitespace by replacing all whitespace
sequences with a single space. This option might be useful
if you want to view the detail spreadsheet in Excel and you
don't want the newlines to mess up the import wizard. |
--confusability |
If present, generate a
separate spreadsheet providing a token- or
pseudo-token-level confusability matrix for all paired
tokens. If any token is paired more than once, the
confusability matrix will not be generated (because the
result would make no sense). The null label comparisons are
included in the matrix. |
--by_token |
By default, the scorer
generates aggregate tag-level scores. If this flag is
present, generate a separate spreadsheet showing aggregate
token-level scores. Note:
in order for token-level scoring to work, the hypothesis
document must be contain token annotations, and the content
annotation boundaries in both documents must align with
token annotation boundaries. If there are no token
annotations, no token-level scores will be generated; if one
or both documents contain token annotations but they're not
aligned with content annotations, the behavior is undefined. |
--by_pseudo_token |
By default, the scorer
generates aggregate tag-level scores. If this flag is
present, generate a separate spreadsheet showing aggregate
scores using what we're call 'pseudo-tokens', which is
essentially the spans created by the union of whitespace
boundaries and span boundaries. For English and other
Roman-alphabet languages, this score should be very, very
close to the token-level score, without requiring the
overhead of having actual token annotations in the document.
For details about pseudo-tokens, see here. |
--by_character |
By default, the scorer
generates aggregate tag-level scores. If this flag is
present, generate a separate spreadsheet showing aggregate
character-scores. For languages like Chinese, this score may
provide some useful sub-phrase metrics without requiring the
overhead of having token annotations in the document. |
--compute_confidence_data |
If present, the scorer will
compute means and variances for the various metrics provided
in the tag and token spreadsheets, if --csv_output_dir is
specified. |
--csv_output_dir <dir> |
By default, the scorer
formats text tables to standard output. If this flag is
present, the scores (if requested) will be written as CSV
files to <dir>/bytag_<format>.csv,
<dir>/bytoken_<format>.csv,
<div>/bypseudotoken_<format>.csv,
<dir>/bychar_<format>.csv,
<dir>/details.csv, and <dir>/confusability.csv.
The value or values for <format> are governed by the
--csv_formula_output option. |
--csv_formula_output
<s> |
A comma-separated list of
options for CSV output. The possibilities are 'oo' (formulas
with OpenOffice separators), 'excel' (formulas with Excel
separators), 'literal' (no formulas). The scorer will
produce CSV output files for each of the conditions you
specify. By default, if --csv_output_dir is specified, this
value is 'excel'. Note that the OpenOffice and Excel formula
formats are incompatible with each other, so you'll only be
able to open output files with Excel separators in Excel,
etc. |
When the user requests confidence data via the --compute_confidence_data option, the scorer produces 1000 alternative score sets. Each score set is created by making M random selections of file scores from the core set of M file scores. (This procedure will naturally have multiple copies of some documents and no copies of others in each score set, which is the source of the variation for this computation.) The scorer then computes the overall metrics for each alternative score set, and computes the mean and variance over the 1000 instances of each of the precision, recall, and fmeasure metrics. This "sampling with replacement" yields a more stable mean and variance.
The readers referenced in the --file_type and --ref_file_type options may introduce additional options, which are described here. These additional options must follow the --file_type and --ref_file_type options. The options for the reference file types are all prepended with a ref_ prefix; so for instance, to specify the --xml_input_is_overlay option for xml-inline reference documents, use the option --ref_xml_input_is_overlay.
Let's say you have two files, /path/to/ref and /path/to/hyp,
which you want to compare. The default settings will print a table
to standard output.
Unix:
% $MAT_PKG_HOME/bin/MATScore --file /path/to/hyp --ref_file /path/to/ref
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --file c:\path\to\hyp --ref_file c:\path\to\ref
Let's say that instead of printing a table to standard output,
you want to produce CSV output with embedded Excel formulas (the
default formula type), and you want all three spreadsheets.
Unix:
% $MAT_PKG_HOME/bin/MATScore --file /path/to/hyp --ref_file /path/to/ref \
--csv_output_dir $PWD --details --by_token
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --file c:\path\to\hyp --ref_file c:\path\to\ref \
--csv_output_dir %CD% --details --by_token
This invocation will not produce any table on standard output,
but will leave three files in the current directory:
bytag_excel.csv, bytoken_excel.csv, and details.csv.
Let's say you have two directories full of files. /path/to/hyp
contains files of the form file<n>.txt.json, and
/path/to/ref contains files of the form file<n>.json. You
want to compare the corresponding files to each other, and you
want tag and token scoring, but not details, and you intend to
view the spreadsheet in OpenOffice.
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --csv_formula_output oo --by_token
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD% --csv_formula_output oo --by_token
For each file in /path/to/hyp, this invocation will prepare a
candidate filename to look for in /path/to/ref by removing the
.txt.json suffix and adding the .json suffix. The current
directory will contain bytag_oo.csv and bytoken_oo.csv.
Let's say that you're in the same situations as example 3, but
you want confidence information included in the output
spreadsheets:
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --csv_formula_output oo --by_token --compute_confidence_data
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD% --csv_formula_output oo --by_token --compute_confidence_data
Let's say that you're in the same situation as example 3, but
your documents contain lots of tags, but you're only interested in
scoring the tags listed in the "Named Entity" task. Furthermore,
you're going to import the data into a tool other than Excel, so
you want the values calculated for you rather than having embedded
equations:
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.json' \
--csv_output_dir $PWD --csv_formula_output literal --by_token --task "Named Entity"
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".json" \
--csv_output_dir %CD% --csv_formula_output literal --by_token --task "Named Entity"
Let's say you're in the same situation as example 3, but your
reference documents are XML inline documents, and are of the form
file<n>.xml. Do this:
Unix:
% $MAT_PKG_HOME/bin/MATScore --dir /path/to/hyp --ref_dir /path/to/ref \
--ref_fsuff_off '.txt.json' --ref_fsuff_on '.xml' \
--csv_output_dir $PWD --csv_formula_output oo --by_token --ref_file_type xml-inline
Windows native:
> %MAT_PKG_HOME%\bin\MATScore.cmd --dir c:\path\to\hyp --ref_dir c:\path\to\ref \
--ref_fsuff_off ".txt.json" --ref_fsuff_on ".xml" \
--csv_output_dir %CD% --csv_formula_output oo --by_token --ref_file_type xml-inline
Note that --ref_fsuff_on has changed, in addition to adding the
--ref_file_type option.