Processing engine

Description

The processing engine manages the execution of a sequence of steps against a set of files.

Note that you should never use MATEngine to save files into workspace; use MATWorkspaceEngine instead.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATEngine

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd

Usage: MATEngine [core options] [input/output/task options] [other options]

Options:
  -h, --help            show this help message and exit

  Core options:
    --other_app_dir=dir
                        additional directory to load a task from. Optional and repeatable.
    --settings_file=file
                        a file of settings to use which overwrites existing settings. The file should a
                        Python config file in the style of the template in etc/MAT_settings.config.in. 
                        Optional.
    --task=task         name of the task to use. Obligatory if the system knows of more than one task.
                        Known tasks are: ...
    --debug             Enable debug output.
...

If no arguments are provided to MATEngine, the help message above is presented. The complete list of options is presented once a task argument is provided. Note that the core options must precede the input, output and task options, which must precede any other options (this is because the later options are added progressively as the earlier options are discovered).

The MAT engine is embedded in a number of other locations, such as the specification of workspace operations and the preprocessing and test corpus processing in the experiment engine specifications. Both of these specifications use XML as their configuration language. Accordingly, we describe both the command line options and their XML equivalents here (with the exception of the core options immediately below, which don't have any XML equivalents).

Core options

--other_app_dir <dir>	If present, a directory to look in to find a MAT task specification. This directory must contain a task.xml file which describes the task. This is only necessary if 'MATManagePluginDirs install' has not been called on the task directory.
--task <s>	The name of a task, as specified in some task.xml file. Required. The known tasks are reported here.
--settings_file	A file of settings to use which overwrites existing settings. The file should be a Python config file in the style of the template in etc/MAT_settings.config.in. Optional.

MATEngine also makes the common options available.

Once a task argument is present, MATEngine summarizes the workflow structure for the task before it prints out the full option list:

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task 'Named Entity'

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "Named Entity"

Error: workflow must be specified
Usage: MATEngine [core options] [input/output/task options] [other options]

Named Entity :
  available workflows:
    Hand annotation : steps zone, tokenize, tag
    Review/repair : steps 
    Demo : steps zone, tokenize, tag

...

The remainder of the options can be grouped into a number of categories.

Task options

The task options control what is done to each input file. A workflow must be specified. You can either apply new steps (with the --steps flag), or undo existing steps (with the --undo_through flag). If neither is specified, the tool operates as a (somewhat expensive) transducer between the input and output formats.

Command line option	XML attribute	Value	Description
--workflow <s>	workflow	The name of a workflow, as specified in some task.xml file	Required. The known workflows for a given task are specified in the "available workflows" subsections in the listing of available tasks printed after the usage string.
--language <l>	language	A language name or code	Language to use, either by name or code, as specified in the task. Obligatory if multiple languages are present and the application of one of the steps varies by language.
--steps <s>	steps	A comma-concatenated sequence of workflow steps	Some ordered subset of the steps in the specified workflow. The steps for a given workflow are specified in the "available workflows" subsections in the listing of available tasks printed after the usage string. If no steps are specified, none will be applied.
--undo_through <s>	undo_through	A step in the current workflow	All possible steps already done in the document which follow this step are undone, including this step, before any of the steps in --steps are applied. You can use this flag in conjunction with --steps to rewind and then reapply operations.
--print_steps <s>	<not available>	A comma-concatenated sequence of workflow steps	Some subset of the steps in the specified workflow. Verbose details about these steps will be printed as they're performed (so the step subset should also be a subset of the steps listed in --steps). The steps should be concatenated with a comma. Only available in the unembedded MATEngine.
--mark_gold <s>	mark_gold	A comma-concatenated sequence of steps	Some subset of the steps in the specified workflow. If the named steps add managed sets, those sets will be marked gold. The --mark_reconciled option takes precedence over this option.
--mark_reconciled <s>	mark_reconciled	A comma-concatenated sequence of steps	Some subset of the steps in the specified workflow. If the named steps add managed sets, those sets will be marked gold. This option takes precedence over the --mark_gold option.
--fresh_task	fresh_task		If this option is present, all task information in each document will be removed and re-inferred before the engine is applied. Use this option if you're processing documents which were created using a task other than the current one.

Input options

The input options specify the input files. You can specify individual files, or directories (possibly filtering their contents using a regular expression). You must specify a file type. For raw files, you can also specify an input character encoding.

Command line option	Value	Description
--input_file <f>		The file to process. Either this or --input_dir must be specified. A single dash ('-') will cause the engine to read from standard input.
--input_dir <d>		The directory to process. Either this or --input_file must be specified.
--input_file_re <s>		If --input_dir is specified, a regular expression to match the filenames in the directory against. The pattern must cover the entire filename (and only the filename, not the full path).
--input_encoding <e>		Input character encoding for raw files. Default is UTF-8.
--input_file_type <t>		The file type of the input. One of the available readers and writers. Required.
--handle_non_bmp <v>	one of 'warn', 'scrub_or_warn', 'fail', 'ignore'	Instructions on how to handle Unicode characters outside the Basic Multilingual Plane. Overrides the default HANDLE_NON_BMP configuration variable. See the Unicode issues discussion for details. Default is 'warn'.

Output options

The output options specify how the result is saved. If you don't specify any output options, the result will be ignored. You can specify an output file for an input file, or an output directory and/or name mapping for an input directory. You must also specify the output format; usually, you'll want this to be one of the rich formats, but "raw" is useful in some rare circumstances. Finally, you can specify an output character encoding for raw files.

Command line option	XML attribute	Value	Description
--output_file <f>			Where to save the output. Optional. Must be paired with --input_file. A single dash ('-') will cause the engine to write to standard output.
--output_dir <d>			Where to save the output. Optional. Must be paired with --input_dir.
--output_fsuff <s>			The suffix to add to each filename when --output_dir is specified. If absent, the name of each file will be identical to the name of the file in the input directory.
--output_file_type <t>			The type of the file to save. One of the available readers and writers. Required if either --output_file or --output_dir is specified.
--output_encoding <e>			Output character encoding for raw files. Default is UTF-8.

Other options

The readers and writers described above may introduce additional options, which are described here. These options must follow the input and output options.

Finally, it's possible for the implementation of individual steps to contribute command-line arguments to MATEngine. There may be default values for these arguments defined in your task, and these command-line specifications override those defaults.

The general options for automated tagging are:

Command line option	XML attribute	Value	Description
--<step_name>_local	<step_name>_local	"yes" (XML)	Don't try to contact a remote tagger server; rather, start up a local command. The <step_name> is the true (not pretty) name of the tag step, with all non-alphanumeric sequences replaced with an underscore.
--<step_name>_model <f>	<step_name>_model	string	Provide a tagger model file. Obligatory if no model is specified in the settings for the relevant task step and no default model is present in the task for that step. The <step_name> is the true (not pretty) name of the tag step, with all non-alphanumeric sequences replaced with an underscore.

In our sample task, <step_name> is carafe_tag.

In addition, the jCarafe tagger provides other tagging options.

Examples

Example 1

Let's say you have a task named "My Task", with a workflow named "All" which contains steps "zone", "tokenize" and "tag" as in the sample 'Named Entity' task. In order to zone and tokenize a raw document /path/to/my/document.txt and save the result to a MAT JSON document:

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw --output_file /path/to/my/document.txt.json \
--output_file_type mat-json

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw --output_file c:\path\to\my\document.txt.json \
--output_file_type mat-json

Example 2

Let's say you want to undo the tokenize step from the document above:

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--undo_through 'tokenize' --input_file /path/to/my/document.txt.json \
--input_file_type mat-json --output_file /path/to/my/document.txt.notoks.json \
--output_file_type mat-json

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--undo_through "tokenize" --input_file c:\path\to\my\document.txt.json \
--input_file_type mat-json --output_file c:\path\to\my\document.txt.notoks.json \
--output_file_type mat-json

Example 3

Let's say you want to process the document as in example 1, but you don't have any interest in saving the results (e.g., you're just testing to see if it breaks):

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw

Example 4

Let's say you want to process the document as in example 1, and you want to see the result, but you don't want to bother saving it to a file:

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw --output_file - --output_file_type mat-json

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw --output_file - --output_file_type mat-json

Example 5

Let's say you want to process the document as in example 1, but you want to see the intermediate results:

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_file /path/to/my/document.txt \
--input_file_type raw --output_file /path/to/my/document.txt.json \
--output_file_type mat-json --print_steps 'zone,tokenize'

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_file c:\path\to\my\document.txt \
--input_file_type raw --output_file c:\path\to\my\document.txt.json \
--output_file_type mat-json --print_steps "zone,tokenize"

Example 6

Let's say you have the output of example 1, but you want to retokenize it. You can simultaneously specify the undo and redo steps. The undo steps will be performed first.

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--undo_through 'tokenize' --steps tokenize \
--input_file /path/to/my/document.txt.json \
--input_file_type mat-json --output_file /path/to/my/document.txt.retoks.json \
--output_file_type mat-json

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--undo_through "tokenize" --steps tokenize \
--input_file c:\path\to\my\document.txt.json \
--input_file_type mat-json --output_file c:\path\to\my\document.txt.retoks.json \
--output_file_type mat-json

Example 7

Let's say you have a directory full of text files in /path/to/my/documents which you want to process, and you want the results to have the identical names, but in /path/to/my/jsondocuments:

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_dir /path/to/my/documents \
--input_file_type raw --output_file /path/to/my/jsondocuments \
--output_file_type mat-json

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_dir c:\path\to\my\documents \
--input_file_type raw --output_file c:\path\to\my\jsondocuments \
--output_file_type mat-json

Example 8

Let's say you want to process your documents as in example 7, but you want to save them back to /path/to/my/documents, with an additional suffix:

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_dir /path/to/my/documents \
--input_file_type raw --output_file /path/to/my/documents \
--output_file_type mat-json --output_fsuff '.json'

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_dir c:\path\to\my\documents \
--input_file_type raw --output_file c:\path\to\my\documents \
--output_file_type mat-json --output_fsuff ".json"

Example 9

Let's say you have a directory like the one that would be created in example 8, with raw and MAT JSON documents intermixed. But all the files you want to process end with ".txt":

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize' --input_dir /path/to/my/documents \
--input_file_type raw --output_file /path/to/my/documents \
--output_file_type mat-json --output_fsuff '.json' \
--input_file_re '.*[.]txt'

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize" --input_dir c:\path\to\my\documents \
--input_file_type raw --output_file c:\path\to\my\documents \
--output_file_type mat-json --output_fsuff ".json" \
--input_file_re ".*[.]txt"

Note that the regular expression is a Python regular expression, and that it must be enclosed in single quotes on the Unix command line to suppress any bash command-line preprocessing (double quotes should be used in Windows, as usual).

Example 10

Let's say your "carafe_tag" step in the "All" workflow is implemented as a jCarafe tag step. By default, MATEngine attempts to contact the MATWeb server for tagging (because the MATWeb server doubles as a proxy for a tagging service, which doesn't need to be started up every time you want to tag a document). You can provide a jCarafe model, and ensure that the engine starts up jCarafe itself rather than trying to contact MATWeb, as follows:

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'zone,tokenize,tag' --input_file /path/to/my/document.txt \
--carafe_tag_local --carafe_tag_model /path/to/my/model \
--input_file_type raw --output_file /path/to/my/document.txt.json \
--output_file_type mat-json

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "zone,tokenize,tag" --input_file c:\path\to\my\document.txt \
--carafe_tag_local --carafe_tag_model c:\path\to\my\model \
--input_file_type raw --output_file c:\path\to\my\document.txt.json \
--output_file_type mat-json

Note that this model overrides any model specified in the task file.

Example 11

Like example 10, except on the output of example 1 (that is, zoning and tokenization are already done):

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow All \
--steps 'tag' --input_file /path/to/my/document.txt.json \
--carafe_tag_local --carafe_tag_model /path/to/my/model \
--input_file_type mat-json --output_file /path/to/my/document.txt.tagged.json \
--output_file_type mat-json

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow All \
--steps "tag" --input_file c:\path\to\my\document.txt.json \
--carafe_tag_local --carafe_tag_model c:\path\to\my\model \
--input_file_type mat-json --output_file c:\path\to\my\document.txt.tagged.json \
--output_file_type mat-json

Example 12

Let's say that you have some XML documents which contain XML content annotations, and you have an "align" step which will align the annotation boundaries with token boundaries after you've tokenized the document. Furthermore, you want all the tags which aren't names of annotations in your task to be preserved in the signal, so you provide the --xml_input_is_overlay option, which is an option enabled by the xml-inline reader:

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow Import \
--steps 'zone,tokenize,align' --input_file /path/to/my/document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file /path/to/my/document.xml.json \
--output_file_type mat-json

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow Import \
--steps "zone,tokenize,align" --input_file c:\path\to\my\document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file c:\path\to\my\document.xml.json \
--output_file_type mat-json

Example 13

Let's say that you want to do example 12 in two steps: first convert to MAT JSON format, then process. To do the conversion, simply call MATEngine without any steps (or use MATTransducer):

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow Import \
--input_file /path/to/my/document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file /path/to/my/document_unprocessed.xml.json \
--output_file_type mat-json

% $MAT_PKG_HOME/bin/MATEngine --task "My Task" --workflow Import \
--steps 'zone,tokenize,align' \
--input_file /path/to/my/document_unprocessed.xml.json \
--input_file_type mat-json \
--output_file /path/to/my/document.xml.json \
--output_file_type mat-json

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow Import \
--input_file c:\path\to\my\document.xml \
--input_file_type xml-inline --xml_input_is_overlay \
--output_file c:\path\to\my\document_unprocessed.xml.json \
--output_file_type mat-json

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "My Task" --workflow Import \
--steps "zone,tokenize,align" \
--input_file c:\path\to\my\document_unprocessed.xml.json \
--input_file_type mat-json \
--output_file c:\path\to\my\document.xml.json \
--output_file_type mat-json