Annotation reporter

Description

The annotation reporter produces concordance-style reports on the content annotations in a given set of documents, either in CSV or text form. The CSV file contains the following columns:

file
the name of the document from which the entry is drawn
start
the start index, in characters, of the span in the document
end
the end index, in characters, of the span in the document
left context
the context to the left of the start index
text
the text in between the start and end indices
label
the label on the span in the document. If the annotation contains attributes and values, these will be represented in the label.
right context
the context to the right of the end index

It's also possible to omit the left and right contexts, if you prefer. The text file contains the same columns, except that file, start, and end are collapsed into a single location column.

This tool also allows you, via the --partition_by_label option, to generate CSV and text files for each content annotation label in the document set. In these versions, the annotation ID is reported in a column after the "end" column, and instead of the "label" column, the file contains a column for each known attribute of the annotation type.

It's also possible to interpolate document-level statistics such as file length and number of annotations per label into these reports.

Because the CSV files contain language data, please consult this special note on how to view them.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATReport

Windows native:

> %MAT_PKG_HOME%\bin\MATReport.cmd

Usage: MATReport [options]

Core options

--task <task>
Name of the task to use. Obligatory if neither --content_annotations nor --content_annotations_all are provided, and more than one task is registered.
--content_annotations ann,ann,ann...
Optional. If --task is not provided, the reporter requires additional, external information to determine which annotations are content annotations. Use this flag to provide a comma-separated sequence of annotation labels which should be treated as content annotations.
--content_annotations_all
Optional. If neither --task nor --content_annotations are provided, this flag will cause all labels in the document to be treated as content annotations.
--content_annotation_sets set,set,... Optional. If the annotations to report on are inferred from a task (either because --task is specified or because neither --content_annotations nor --content_annotations_all is specified and there's a single task), the annotations which are available by default are those whose labels are in a non-administrative annotation set (i.e., a set which is not in the admin, zone, or token categories). If you want to replace these default annotation sets, use this flag to provide a comma-separated sequence of set names (or category names using the 'category:' prefix).
--exclude_annotation_sets set,set,...
Optional. If the annotations to report on are inferred from a task (either because --task is specified or because neither --content_annotations nor --content_annotations_all is specified and there's a single task), the annotations which are available by default are those whose labels are in a non-administrative annotation set (i.e., a set which is not in the admin, zone, or token categories). If you want to exclude some of these sets or categories, use this flag to provide a comma-separated sequence of set names (or category names using the 'category:' prefix).
--include_annotation_sets set,set,... Optional. If the annotations to report on are inferred from a task (either because --task is specified or because neither --content_annotations nor --content_annotations_all is specified and there's a single task), the annotations which are available by default are those whose labels are in a non-administrative annotation set (i.e., a set which is not in the admin, zone, or token categories). If you want to include other sets or categories, use this flag to provide a comma-separated sequence of set names (or category names using the 'category:' prefix).
--verbose
If present, the tool will provide detailed information on its progress.

Input options

--input_files <file>
A glob-style pattern describing full pathnames to be reported on. May be specified with --input_dir. Can be repeated.
--input_dir <dir>
A directory, all of whose files will be reported on. Can be repeated. May be specified with --input_files.
--file_type <t>
The file type of the document(s). One of the readers. Default is mat-json.
--encoding <e>
The encoding of the input. The default is the appropriate default for the file type.
--handle_non_bmp <v>
Instructions on how to handle Unicode characters outside the Basic Multilingual Plane. Overrides the default HANDLE_NON_BMP configuration variable. Value is one of 'warn', 'scrub_or_warn', 'fail', 'ignore'. See the Unicode issues discussion for details. Default is 'warn'.

Output options

--output_dir <dir>
The output directory for the reports. Will be created if it doesn't exist. Required.
--csv
Generate a CSV file in the output directory, with concordance-style data: file, location, content, left and right context, annotation label. At least one of this option or --txt must be provided. The CSV file will be in UTF-8 encoding. See this special note on viewing CSV files containing natural language text.
--txt
Generate a text file in the output directory, with concordance-style data, sorted first by annotation label and then by content. At least one of this option or --csv must be provided. The output file will be in UTF-8 encoding.
--concordance_window <i>
Use the specified value as the window size on each side of the concordance. Default is 32.
--omit_concordance_context
Omit the left and right concordance context from the output.
--file_csv
Generate a separate CSV file consisting of file-level statistics such as file size in characters and number of annotations of each type.
--interpolate_file_info
Instead of a separate CSV file for the file-level statistics, interpolate them into the concordance.
--include_spanless
By default, only spanned content annotations are produced. If this flag is present, spanless annotations (without position or left or right context, of course) will be included. If the spanless annotations refer to spanned annotations, the text context of the referred annotations will be inserted in the 'text' column.
--partition_by_label
If present, in addition to the standard output file report.csv and/or report.txt, the tool will generate a separate spreadsheet for each label, with a column for each attribute.
--show_spanned_text_for_annotation_attribute_values
If present, show the text for annotation attribute values at the first level, if possible.
--normalize_whitespace
If present, normalize the whitespace in the text spans by replacing all whitespace sequences with a single space. This option might be useful if you want to view the report in Excel and you don't want the newlines to mess up the import wizard.
--create_task <name>
If present, in addition to the standard output file, the tool will generate a task subdirectory which contains a task.xml file appropriate for viewing your annotations, which you can install using MATManagePluginDirs. This is useful when you have annotated files which you did not annotate with MAT.

Other options

The readers referenced in the --file_type option may introduce additional options, which are described here. These additional options must follow the --file_type option.

Examples

Example 1

Let's say you have a file, /path/to/file, whose annotations you want to view in a spreadsheet. You want the results to be written to /path/to/output.

Unix:

% $MAT_PKG_HOME/bin/MATReport --input_files /path/to/file --csv --output_dir /path/to/output

Windows native:

> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files c:\path\to\file --csv --output_dir c:\path\to\output

Example 2

Let's say that you only want textual output, and you don't want the concordance columns:

Unix:

% $MAT_PKG_HOME/bin/MATReport --input_files /path/to/file --txt \
--output_dir /path/to/output --omit_concordance_context


Windows native:

> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files c:\path\to\file --txt \
--output_dir c:\path\to\output --omit_concordance_context

Example 3

Let's say you have a directory full of files. /path/to/files contains files of the form file<n>.json. You want to view them both in CSV and in text, and you want a smaller concordance window of 10 characters.

Unix:

% $MAT_PKG_HOME/bin/MATReport --input_files '/path/to/files/*.json' \
--csv --txt --output_dir /path/to/output --concordance_window 10

Windows native:

> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files 'c:\path\to\files\*.json' \
-csv --txt --output_dir c:\path\to\output --concordance_window 10

Example 4

Let's say you have a directory full of XML inline files which you did not annotate with MAT. You want to view the report in CSV, but you also want to prepare a task.xml file which you can install and view these documents with.

% $MAT_PKG_HOME/bin/MATReport --input_files '/path/to/files/*.xml' \
--csv --output_dir /path/to/output --omit_concordance_context --include_spanless \
--create_task "New task"

Windows native:

> %MAT_PKG_HOME%\bin\MATReport.cmd --input_files 'c:\path\to\files\*.xml' \
-csv --output_dir c:\path\to\output
--omit_concordance_context --include_spanless \
--create_task "New task"

Once this operation completes, the directory "task" in the output directory can be installed using MATManagePluginDirs.