Experiment engine

Description

The experiment engine runs the experiment described in an experiment XML file. The experiment engine runs this experiment in a directory which is provided to it either via the XML file (the "dir" attribute of the <experiment> element) or on the command line (the --exp_dir option). The experiment engine prepares the corpora, builds the models, and performs the experiment runs in this directory. The experiment XML file is copied into the experiment directory, if a file with the same name is not already present.

By default, the engine can continue an experiment which is halted in the middle. Each corpus, model set and run stores its metadata in a file called "properties.txt" in its specified directory, and keeps track of whether it's been completed or not. If the engine fails in the middle, it will not redo work it knows has been completed. The --force argument overrides this default behavior, and ought to force a full rerun of the experiment; however, the interactions among the components are extremely complex, and --force often fails. If you want to rerun an experiment, the safest thing to do is use a different experiment directory.

There is one exception to this generalization. If there are experimental runs present, the engine will always score them, even if it's scored them before. So an easy way to review the scores for an experiment is just to run the engine again.

See here for a set of use cases for the experiment engine; see here for the structure of its output directory.

Usage

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd

Usage: MATExperimentEngine [options] <xml_file>

<xml_file>: An experiment XML file

Options

--language <l>
A language name or code supported by the experiment task. Ignored if the experiment XML file specifies a language. Obligatory if no language is specified in the experiment XML file; the task supports multiple languages; and some of the actions executed in the experiment vary depending on the language.
--exp_dir <dir>
Optionally, the directory the experiment will be run in. This directory may also be provided in the experiment XML file (if both are provided, the command-line setting is ignored). The directory will be created if it doesn't yet exist.
--pattern_dir <dir>
Optionally, this path is the prefix used for relative directory paths in file patterns in the <pattern> element in the corpora in the experiment XML file. Otherwise, these patterns must be absolute pathnames.
--binding <k>=<v>
Optionally, add a binding to be used in expanding settings in the experiment file. These values override values in the experiment file itself.
--csv_formula_output <s>
A comma-separated list of options for CSV output. The possibilities are 'oo' (formulas with OpenOffice separators), 'excel' (formulas with Excel separators), 'literal' (no formulas). The experiment engine will produce CSV output files for each of the conditions you specify. By default, this value is 'excel'. Note that the OpenOffice and Excel formula formats are incompatible with each other, so you'll only be able to open output files with Excel separators in Excel, etc.
--dont_compute_confidence
By default, the experiment engine computes confidence measures when it runs the scorer. This process can be time consuming. Disable it with this flag.
--dont_rescore
By default, the experiment engine rescores complete runs when it's restarted. Use this flag to disable this feature. This should only be used for debugging purposes, because the scores from the completed runs won't be accumulated in this mode.

MATExperimentEngine also makes the common options available.

Advanced options

These options are more complicated, and not as well supported. Use them at your own risk.

--force
If present, forces the reprocess of the experiment file.
--batch_test_runs
By default, test runs are performed as soon as the relevant model is available. This flag postpones all test runs until after all models are constructed.
--mark_done
This flag is intended for the exceptional situation where you've interrupted an experiment before it's completed, and you just want to rerun the scoring for what's already done. This flag will force the engine to mark all corpora, models and runs as completed. The effect is that from this point on, the engine will only report scores for this experiment.

Examples

For examples of the experiment XML files themselves, look here.

Example 1

Let's say your experiment XML file /document/exp_files/exp.xml contains a value for the "dir" attribute of the <experiment> element, and all the paths in the <pattern> elements are absolute. Then your invocation is simple:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd c:\documents\exp_files\exp.xml

Example 2

Let's say that your experiment XML file does not contain a value for the "dir" attribute, and you want to create an experiment run in /documents/exp_runs/run1:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
/documents/exp_files/exp.xml


Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
c:\documents\exp_files\exp.xml

Example 3

Let's say you have the same situation as in example 2, but you don't want spreadsheet formulas in your output, because you're feeding the data to a statistical package like R instead of to Excel:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--csv_formula_output literal /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--csv_formula_output literal c:\documents\exp_files\exp.xml

Example 4

Let's say that you have the same situation as in example 2, and you want to view the results in a spreadsheet, but you can't afford Excel, and you're using OpenOffice instead:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--csv_formula_output oo /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--csv_formula_output oo c:\documents\exp_files\exp.xml

Example 5

Let's say you're in the same situation as in example 2, but you have relative pathnames in <pattern> elements in your XML file, and all the document paths are a suffix of /documents/completed:

Unix:

% $MAT_PKG_HOME/bin/MATExperimentEngine --exp_dir /documents/exp_runs/run1 \
--pattern_dir /documents/completed /documents/exp_files/exp.xml

Windows native:

> %MAT_PKG_HOME%\bin\MATExperimentEngine.cmd --exp_dir c:\documents\exp_runs\run1 \
--pattern_dir c:\documents\completed c:\documents\exp_files\exp.xml