The annotation reporter produces concordance-style reports on the
content annotations in a given set of documents, either in CSV or
text form. The CSV file contains the following columns:
file |
the name of the document from
which the entry is drawn |
start |
the start index, in
characters, of the span in the document |
end |
the end index, in characters,
of the span in the document |
left context |
the context to the left of
the start index |
text |
the text in between the start
and end indices |
label |
the label on the span in the
document. If the annotation contains attributes and values,
these will be represented in the label. |
right context |
the context to the right of
the end index |
It's also possible to omit the left and right contexts, if you
prefer. The text file contains the same columns, except that file,
start, and end are collapsed into a single location column.
This tool also allows you, via the --partition_by_label option,
to generate CSV and text files for each content annotation label
in the document set. In these versions, the annotation ID is
reported in a column after the "end" column, and instead of the
"label" column, the file contains a column for each known
attribute of the annotation type.
It's also possible to interpolate document-level statistics such
as file length and number of annotations per label into these
reports.
Because the CSV files contain language data, please consult this
special note on how to view
them.
Unix:
% $MAT_PKG_HOME/bin/MATReport
Windows native:
> %MAT_PKG_HOME%\bin\MATReport.cmd
Usage: MATReport [options]
--task <task> |
Name of the task to use.
Obligatory if neither --content_annotations nor
--content_annotations_all are provided, and more than one
task is registered. |
--content_annotations
ann,ann,ann... |
Optional. If --task is not
provided, the reporter requires additional, external
information to determine which annotations are content
annotations. Use this flag to provide a comma-separated
sequence of annotation labels which should be treated as
content annotations. |
--content_annotations_all |
Optional. If neither --task
nor --content_annotations are provided, this flag will cause
all labels in the document to be treated as content
annotations. |
--content_annotation_sets set,set,... | Optional. If the annotations to report on are
inferred from a task (either because --task is specified or
because neither --content_annotations nor
--content_annotations_all is specified and there's a single
task), the annotations which are available by default are
those whose labels are in a non-administrative annotation
set (i.e., a set which is not in the admin, zone, or token
categories). If you want to replace these default annotation
sets, use this flag to provide a comma-separated sequence of
set names (or category names using the 'category:' prefix). |
--exclude_annotation_sets set,set,... |
Optional. If the annotations to report on are
inferred from a task (either because --task is specified or
because neither --content_annotations nor
--content_annotations_all is specified and there's a single
task), the annotations which are available by default are
those whose labels are in a non-administrative annotation
set (i.e., a set which is not in the admin, zone, or token
categories). If you want to exclude some of these sets or
categories, use this flag to provide a comma-separated
sequence of set names (or category names using the
'category:' prefix). |
--include_annotation_sets set,set,... | Optional. If the annotations to report on are
inferred from a task (either because --task is specified or
because neither --content_annotations nor
--content_annotations_all is specified and there's a single
task), the annotations which are available by default are
those whose labels are in a non-administrative annotation
set (i.e., a set which is not in the admin, zone, or token
categories). If you want to include other sets or
categories, use this flag to provide a comma-separated
sequence of set names (or category names using the
'category:' prefix). |
--verbose |
If present, the tool will provide detailed
information on its progress. |
--input_files <file> |
A glob-style pattern describing full pathnames to be reported on. May be specified with --input_dir. Can be repeated. |
--input_dir <dir> |
A directory, all of whose
files will be reported on. Can be repeated. May be specified
with --input_files. |
--file_type <t> |
The file type of the
document(s). One of the readers.
Default is mat-json. |
--encoding <e> |
The encoding of the input.
The default is the appropriate default for the file type. |
--handle_non_bmp <v> |
Instructions on how to handle Unicode
characters outside the Basic Multilingual Plane. Overrides
the default HANDLE_NON_BMP configuration
variable. Value is one of 'warn', 'scrub_or_warn', 'fail',
'ignore'. See the Unicode
issues discussion for details. Default is 'warn'. |
--output_dir <dir> |
The output directory for the
reports. Will be created if it doesn't exist. Required. |
--csv |
Generate a CSV file in the output directory, with concordance-style data: file, location, content, left and right context, annotation label. At least one of this option or --txt must be provided. The CSV file will be in UTF-8 encoding. See this special note on viewing CSV files containing natural language text. |
--txt |
Generate a text file in the
output directory, with concordance-style data, sorted first
by annotation label and then by content. At least one of
this option or --csv must be provided. The output file will
be in UTF-8 encoding. |
--concordance_window
<i> |
Use the specified value as
the window size on each side of the concordance. Default is
32. |
--omit_concordance_context |
Omit the left and right
concordance context from the output. |
--file_csv |
Generate a separate CSV file
consisting of file-level statistics such as file size in
characters and number of annotations of each type. |
--interpolate_file_info |
Instead of a separate CSV
file for the file-level statistics, interpolate them into
the concordance. |
--include_spanless |
By default, only spanned content annotations
are produced. If this flag is present, spanless annotations
(without position or left or right context, of course) will
be included. If the spanless annotations refer to spanned
annotations, the text context of the referred annotations
will be inserted in the 'text' column. |
--partition_by_label |
If present, in addition to the standard
output file report.csv and/or report.txt, the tool will
generate a separate spreadsheet for each label, with a
column for each attribute. |
--show_spanned_text_for_annotation_attribute_values |
If present, show the text for annotation
attribute values at the first level, if possible. |
--normalize_whitespace |
If present, normalize the whitespace in the
text spans by replacing all whitespace sequences with a
single space. This option might be useful if you want to
view the report in Excel and you don't want the newlines to
mess up the import wizard. |
--create_task <name> |
If present, in addition to the standard
output file, the tool will generate a task subdirectory
which contains a task.xml file appropriate for viewing your
annotations, which you can install using MATManagePluginDirs.
This is useful when you have annotated files which you did
not annotate with MAT. |
The readers referenced in the --file_type option may introduce
additional options, which are described here. These additional
options must follow the --file_type option.
Let's say you have a file, /path/to/file, whose annotations you
want to view in a spreadsheet. You want the results to be written
to /path/to/output.
Unix:
% $MAT_PKG_HOME/bin/MATReport --input_files /path/to/file --csv --output_dir /path/to/output
Windows native:
> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files c:\path\to\file --csv --output_dir c:\path\to\output
Let's say that you only want textual output, and you don't want
the concordance columns:
Unix:
% $MAT_PKG_HOME/bin/MATReport --input_files /path/to/file --txt \
--output_dir /path/to/output --omit_concordance_context
Windows native:
> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files c:\path\to\file --txt \
--output_dir c:\path\to\output --omit_concordance_context
Let's say you have a directory full of files. /path/to/files
contains files of the form file<n>.json. You want to view
them both in CSV and in text, and you want a smaller concordance
window of 10 characters.
Unix:
% $MAT_PKG_HOME/bin/MATReport --input_files '/path/to/files/*.json' \
--csv --txt --output_dir /path/to/output --concordance_window 10
Windows native:
> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files 'c:\path\to\files\*.json' \
-csv --txt --output_dir c:\path\to\output --concordance_window 10
Let's say you have a directory full of XML inline files which you
did not annotate with MAT. You want to view the report in CSV, but
you also want to prepare a task.xml file which you can install and
view these documents with.
% $MAT_PKG_HOME/bin/MATReport --input_files '/path/to/files/*.xml' \
--csv --output_dir /path/to/output --omit_concordance_context --include_spanless \
--create_task "New task"
Windows native:
> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files 'c:\path\to\files\*.xml' \
-csv --output_dir c:\path\to\output --omit_concordance_context --include_spanless \
--create_task "New task"
Once this operation completes, the directory "task" in the output
directory can be installed using MATManagePluginDirs.