The processing engine manages the execution of a sequence of
steps against a set of files.
Note that you should never use MATEngine
to save files into workspace; use MATWorkspaceEngine instead.
Unix:
% $MAT_PKG_HOME/bin/MATEngine
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd
Usage: MATEngine [core options] [input/output/task options] [other options]
Options:
-h, --help show this help message and exit
Core options:
--other_app_dir=dir
additional directory to load a task from. Optional and repeatable.
--settings_file=file
a file of settings to use which overwrites existing settings. The file should a
Python config file in the style of the template in etc/MAT_settings.config.in.
Optional.
--task=task name of the task to use. Obligatory if the system knows of more than one task.
Known tasks are: ...
--debug Enable debug output.
...
If no arguments are provided to MATEngine, the help message above
is presented. The complete list of options is presented once a
task argument is provided. Note that the core options must precede
the input, output and task options, which must precede any other
options (this is because the later options are added progressively
as the earlier options are discovered).
The MAT engine is embedded in a number of other locations, such
as the specification of
workspace operations and the preprocessing and test corpus
processing in the experiment engine
specifications. Both of these specifications use XML as
their configuration language. Accordingly, we describe both the
command line options and their XML equivalents here (with the
exception of the core options immediately below, which don't have
any XML equivalents).
--other_app_dir <dir> |
If present, a directory to
look in to find a MAT task specification. This directory
must contain a task.xml file which describes the task. This
is only necessary if 'MATManagePluginDirs install' has not
been called on the task directory. |
--task <s> |
The name of a task, as
specified in some task.xml file. Required. The known tasks
are reported here. |
--settings_file |
A file of settings to use
which overwrites existing settings. The file should be a
Python config file in the style of the template in
etc/MAT_settings.config.in. Optional. |
MATEngine also makes the common
options available.
Once a task argument is present, MATEngine summarizes the
workflow structure for the task before it prints out the full
option list:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task 'Named Entity'
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "Named Entity"
Error: workflow must be specified
Usage: MATEngine [core options] [input/output/task options] [other options]
Named Entity :
available workflows:
Hand annotation : steps zone, tokenize, tag
Review/repair : steps
Demo : steps zone, tokenize, tag
...
The remainder of the options can be grouped into a number of
categories.
The task options control what is done to each input file. A
workflow must be specified. You can either apply new steps (with
the --steps flag), or undo existing steps (with the --undo_through
flag). If neither is specified, the tool operates as a (somewhat
expensive) transducer between the input and output formats.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--workflow <s> |
workflow |
The name of a workflow, as specified in some task.xml file | Required. The known workflows
for a given task are specified in the "available workflows"
subsections in the listing of available tasks printed after
the usage string. |
--language <l> |
language |
A language name or code |
Language to use, either by name or code, as
specified in the task. Obligatory if multiple languages are
present and the application of one of the steps varies by
language. |
--steps <s> |
steps |
A comma-concatenated sequence
of workflow steps |
Some ordered subset of the
steps in the specified workflow. The steps for a given
workflow are specified in the "available workflows"
subsections in the listing of available tasks printed after
the usage string. If no steps are specified, none will be
applied. |
--undo_through <s> |
undo_through |
A step in the current workflow | All possible steps already
done in the document which follow this step are undone,
including this step, before any of the steps in --steps are
applied. You can use this flag in conjunction with --steps
to rewind and then reapply operations. |
--print_steps <s> |
<not available> |
A comma-concatenated sequence of workflow steps | Some subset of the steps in
the specified workflow. Verbose details about these steps
will be printed as they're performed (so the step subset
should also be a subset of the steps listed in --steps). The
steps should be concatenated with a comma. Only available in
the unembedded MATEngine. |
--mark_gold <s> |
mark_gold |
A comma-concatenated sequence of steps | Some subset of the steps in the specified
workflow. If the named steps add managed sets, those sets
will be marked gold. The --mark_reconciled option takes
precedence over this option. |
--mark_reconciled <s> |
mark_reconciled |
A comma-concatenated sequence of steps | Some subset of the steps in the specified
workflow. If the named steps add managed sets, those sets
will be marked gold. This option takes precedence over the
--mark_gold option. |
--fresh_task |
fresh_task |
If this option is present, all task
information in each document will be removed and re-inferred
before the engine is applied. Use this option if you're
processing documents which were created using a task other
than the current one. |
The input options specify the input files. You can specify
individual files, or directories (possibly filtering their
contents using a regular expression). You must specify a file
type. For raw files, you can also specify an input character
encoding.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--input_file <f> |
The file to process. Either
this or --input_dir must be specified. A single dash ('-')
will cause the engine to read from standard input. |
||
--input_dir <d> |
The directory to process.
Either this or --input_file must be specified. |
||
--input_file_re <s> |
If --input_dir is specified,
a regular expression to match the filenames in the directory
against. The pattern must cover the entire filename (and
only the filename, not the full path). |
||
--input_encoding <e> |
Input character encoding for
raw files. Default is UTF-8. |
||
--input_file_type <t> |
The file type of the input.
One of the available readers
and writers. Required. |
||
--handle_non_bmp <v> |
one of 'warn', 'scrub_or_warn', 'fail', 'ignore' |
Instructions on how to handle Unicode
characters outside the Basic Multilingual Plane. Overrides
the default HANDLE_NON_BMP configuration
variable. See the Unicode
issues discussion for details. Default is 'warn'. |
The output options specify how the result is saved. If you don't
specify any output options, the result will be ignored. You can
specify an output file for an input file, or an output directory
and/or name mapping for an input directory. You must also specify
the output format; usually, you'll want this to be one of the rich
formats, but "raw" is useful in some rare circumstances. Finally,
you can specify an output character encoding for raw files.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--output_file <f> |
Where to save the output.
Optional. Must be paired with --input_file. A single dash
('-') will cause the engine to write to standard output. |
||
--output_dir <d> |
Where to save the output.
Optional. Must be paired with --input_dir. |
||
--output_fsuff <s> |
The suffix to add to each
filename when --output_dir is specified. If absent, the name
of each file will be identical to the name of the file in
the input directory. |
||
--output_file_type <t> |
The type of the file to save.
One of the available readers
and writers. Required if either --output_file or
--output_dir is specified. |
||
--output_encoding <e> |
Output character encoding for
raw files. Default is UTF-8. |
The readers and writers described above may introduce additional
options, which are described here.
These options must follow the input and output options.
Finally, it's possible for the
implementation of individual steps to contribute
command-line arguments to MATEngine. There may be default values
for these arguments defined in your task, and these command-line
specifications override those defaults.
The general options for automated tagging are:
Command line option |
XML attribute | Value |
Description |
---|---|---|---|
--<step_name>_local |
<step_name>_local |
"yes" (XML) |
Don't try to contact a remote
tagger server; rather, start up a local command. The
<step_name> is the true (not pretty) name of the tag
step, with all non-alphanumeric sequences replaced with an
underscore. |
--<step_name>_model
<f> |
<step_name>_model |
string |
Provide a tagger model file. Obligatory if no model is specified in the settings for the relevant task step and no default model is present in the task for that step. The <step_name> is the true (not pretty) name of the tag step, with all non-alphanumeric sequences replaced with an underscore. |
In our sample task,
<step_name> is carafe_tag.
In addition, the jCarafe tagger
provides other tagging options.
Let's say you have a task named "My Task", with a workflow named
"All" which contains steps "zone", "tokenize" and "tag" as in the
sample 'Named Entity' task. In
order to zone and tokenize a raw document /path/to/my/document.txt
and save the result to a MAT JSON document:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw --output_file /path/to/my/document.txt.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw --output_file c:\path\to\my\document.txt.json \
--output_file_type mat-json
Let's say you want to undo the tokenize step from the document
above:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--undo_through 'tokenize' --input_file /path/to/my/document.txt.json \
--input_file_type mat-json --output_file /path/to/my/document.txt.notoks.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--undo_through "tokenize" --input_file c:\path\to\my\document.txt.json \
--input_file_type mat-json --output_file c:\path\to\my\document.txt.notoks.json \
--output_file_type mat-json
Let's say you want to process the document as in example 1, but
you don't have any interest in saving the results (e.g., you're
just testing to see if it breaks):
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw
Let's say you want to process the document as in example 1, and
you want to see the result, but you don't want to bother saving it
to a file:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw --output_file - --output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw --output_file - --output_file_type mat-json
Let's say you want to process the document as in example 1, but
you want to see the intermediate results:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw --output_file /path/to/my/document.txt.json \
--output_file_type mat-json --print_steps 'zone,tokenize'
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw --output_file c:\path\to\my\document.txt.json \
--output_file_type mat-json --print_steps "zone,tokenize"
Let's say you have the output of example 1, but you want to
retokenize it. You can simultaneously specify the undo and redo
steps. The undo steps will be performed first.
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--undo_through 'tokenize' --steps tokenize \
--input_file /path/to/my/document.txt.json \
--input_file_type mat-json --output_file /path/to/my/document.txt.retoks.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--undo_through "tokenize" --steps tokenize \
--input_file c:\path\to\my\document.txt.json \
--input_file_type mat-json --output_file c:\path\to\my\document.txt.retoks.json \
--output_file_type mat-json
Let's say you have a directory full of text files in
/path/to/my/documents which you want to process, and you want the
results to have the identical names, but in
/path/to/my/jsondocuments:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_dir /path/to/my/documents \
--input_file_type raw --output_file /path/to/my/jsondocuments \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_dir c:\path\to\my\documents \
--input_file_type raw --output_file c:\path\to\my\jsondocuments \
--output_file_type mat-json
Let's say you want to process your documents as in example 7, but
you want to save them back to /path/to/my/documents, with an
additional suffix:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_dir /path/to/my/documents \
--input_file_type raw --output_file /path/to/my/documents \
--output_file_type mat-json --output_fsuff '.json'
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_dir c:\path\to\my\documents \
--input_file_type raw --output_file c:\path\to\my\documents \
--output_file_type mat-json --output_fsuff ".json"
Let's say you have a directory like the one that would be created
in example 8, with raw and MAT JSON documents intermixed. But all
the files you want to process end with ".txt":
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_dir /path/to/my/documents \
--input_file_type raw --output_file /path/to/my/documents \
--output_file_type mat-json --output_fsuff '.json' \
--input_file_re '.*[.]txt'
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_dir c:\path\to\my\documents \
--input_file_type raw --output_file c:\path\to\my\documents \
--output_file_type mat-json --output_fsuff ".json" \
--input_file_re ".*[.]txt"
Note that the regular expression is a Python regular expression,
and that it must be enclosed in single quotes on the Unix command
line to suppress any bash command-line preprocessing (double
quotes should be used in Windows, as usual).
Let's say your "carafe_tag" step in the "All" workflow is
implemented as a jCarafe tag step. By default, MATEngine attempts to contact
the MATWeb server for tagging (because the MATWeb server
doubles as a proxy for a tagging service, which doesn't need to be
started up every time you want to tag a document). You can provide
a jCarafe model, and ensure that the engine starts up jCarafe
itself rather than trying to contact MATWeb, as follows:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize,tag' --input_file /path/to/my/document.txt \
--carafe_tag_local --carafe_tag_model /path/to/my/model \
--input_file_type raw --output_file /path/to/my/document.txt.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize,tag" --input_file c:\path\to\my\document.txt \
--carafe_tag_local --carafe_tag_model c:\path\to\my\model \
--input_file_type raw --output_file c:\path\to\my\document.txt.json \
--output_file_type mat-json
Note that this model overrides any model specified in the task
file.
Like example 10, except on the output of example 1 (that is,
zoning and tokenization are already done):
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'tag' --input_file /path/to/my/document.txt.json \
--carafe_tag_local --carafe_tag_model /path/to/my/model \
--input_file_type mat-json --output_file /path/to/my/document.txt.tagged.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "tag" --input_file c:\path\to\my\document.txt.json \
--carafe_tag_local --carafe_tag_model c:\path\to\my\model \
--input_file_type mat-json --output_file c:\path\to\my\document.txt.tagged.json \
--output_file_type mat-json
Let's say that you have some XML documents which contain XML
content annotations, and you have an "align" step which will align
the annotation boundaries with token boundaries after you've
tokenized the document. Furthermore, you want all the tags which
aren't names of annotations in your task to be preserved in the
signal, so you provide the --xml_input_is_overlay option, which is
an option enabled by the xml-inline
reader:
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow Import \
--steps 'zone,tokenize,align' --input_file /path/to/my/document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file /path/to/my/document.xml.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow Import \
--steps "zone,tokenize,align" --input_file c:\path\to\my\document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file c:\path\to\my\document.xml.json \
--output_file_type mat-json
Let's say that you want to do example 12 in two steps: first
convert to MAT JSON format, then process. To do the conversion,
simply call MATEngine without any steps (or use MATTransducer):
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow Import \
--input_file /path/to/my/document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file /path/to/my/document_unprocessed.xml.json \
--output_file_type mat-json
% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow Import \
--steps 'zone,tokenize,align' \
--input_file /path/to/my/document_unprocessed.xml.json \
--input_file_type mat-json \
--output_file /path/to/my/document.xml.json \
--output_file_type mat-json
Windows native:
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow Import \
--input_file c:\path\to\my\document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file c:\path\to\my\document_unprocessed.xml.json \
--output_file_type mat-json
> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow Import \
--steps "zone,tokenize,align" \
--input_file c:\path\to\my\document_unprocessed.xml.json \
--input_file_type mat-json \
--output_file c:\path\to\my\document.xml.json \
--output_file_type mat-json