Tutorial 6: The experiment engine

At any point, you might have a number of questions about the state of your corpus.

You may want to know simply how well your training and tagging strategies are doing.
You may want to know how the tagging improves as you increase the number of training documents.
You may want to know how models trained on one corpus of documents perform against a different document corpus.
You may want to know how differently constructed models perform against the same corpus.
You may want to know how different tagging strategies affect the output.

These and other questions can be answered easily by the experiment engine, MATExperimentEngine. The power of the experiment engine lies largely in its rich XML configuration. In this tutorial, we'll learn how to use the experiment engine to answer one of the questions above, and you can examine the other documentation to see how you might answer other questions, as illustrated in the use cases.

We're going to use the same simple 'Named Entity' task, and we're going to assume that your task is installed. This tutorial involves both the UI and the command line. Because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.

Step 1: Review your XML file for question 1

This step is fairly easy, because the XML file to answer the first question is included as part of the distribution. The XML file is found in MAT_PKG_HOME/sample/ne/test/exp/exp.xml, and it looks like this:

<experiment task='Named Entity'>
  <corpora dir="corpora">
    <partition name="train" fraction=".8"/>
    <partition name="test" fraction=".2"/>
    <corpus name="ne">
      <pattern>*.json</pattern>
    </corpus>
  </corpora>
  <model_sets dir="model_sets">
    <model_set name="ne_model">
      <training_corpus corpus="ne" partition="train"/>
    </model_set>
  </model_sets>
  <runs dir="runs">
    <run_settings>
      <args steps="zone,tokenize,tag" workflow="Demo"/>
    </run_settings>
    <run name="test_run" model="ne_model">
      <test_corpus corpus="ne" partition="test"/>
    </run>
  </runs>
</experiment>

This is one of the simplest complete experiment XML files you can create. As with all experiment XML files, it describes three types of entities.

Corpora are collections of files. Corpora are named, and they are described by the <corpus> element. These elements appear as children of <corpora> elements which allow you to group together processing instructions for multiple corpora. Corpora can be partitioned, as we see here for the "ne" corpus. Each "fraction" is a share of the corpus which is allocated to that partition; the fractions are normalized before the corpus is split, and the overall split is randomly applied. You can have as many partitions in a corpus as you want. The partitions are non-overlapping. We see that these documents should all match the pattern "*.json"; the directory to look for these documents in will be specified at runtime. You can have multiple <corpora> elements.
Model sets are sets of incrementally built models. They, too, are named, and they are described by the <model_set> element. You can specify a number of training documents to build a model at (e.g., every 100 documents, every 200 documents); if the increment is missing, as it is here, the model set will consist of a single model built from all the training documents in the named corpus. You can group these model sets together for the purposes of sharing processing information under the <model_sets> element (although no processing information needs to be specified in this example). You can have multiple <model_sets> elements.
Runs create your experiment output data. They, too, are named, and they are described by the <run> element. The <run> elements are grouped together under <runs> elements to share processing information. For each run, the test documents in the named corpora or partitions are stripped of their relevant annotations, and the processing information is used to generate a hypothesis for each stripped document for each model in the named model set. The results are then compared with the gold standard document using the scorer, broken down by run and model.

In most cases, your corpora should consist exclusively of annotated documents which have been marked gold.

So this experiment takes a single set of documents, and designates 80% of the set for training and the remaining 20% for test. It then generates a single model from the training documents, and executes a single run using this model against the test documents.

Step 2: Run the experiment

This operation is a command-line operation. Try it:

Unix:

% cd $MAT_PKG_HOME
% bin/MATExperimentEngine --exp_dir /tmp/exp \
--pattern_dir $PWD/sample/ne/resources/data/json sample/ne/test/exp/exp.xml

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATExperimentEngine.cmd --exp_dir %TMP%\exp \
--pattern_dir %CD%\sample\ne\resources\data\json sample\ne\test\exp\exp.xml

The --exp_dir is the directory where the corpora, models and runs will be computed (and stored, if necessary), and where the results will be found. The --pattern_dir is the directory in which to look for the files referred to in the <pattern> elements in the experiment XML file; the patterns are so-called Unix "glob" patterns, which are standard file patterns which should be familiar to any user of the Unix shell. The final argument is the experiment XML file itself.

The engine will create the directory, copy the experiment XML file into it for archive purposes, and then run the experiment as described in step 1.

Step 3: Review the results

Look in the experiment directory.

Unix:

% ls /tmp/exp

Windows native:

> dir %TMP%/exp

allbytag_excel.csv	corpora			model_sets
allbytoken_excel.csv	exp.xml			runs

The corpora, model_sets and runs subdirectories are as specified in the experiment XML file above (that's what the "dir" attribute does). What you'll be most interested in are the files allbytag_excel.csv and allbytoken_excel.csv. These files contain the tag-level and token-level scoring results (including Excel-style formulas) for all the runs. The format and interpretation of these results is found in the documentation for the scoring output, except that the initial columns are different; you can find a description of the differences in the documentation for MATExperimentEngine.

Under /tmp/exp/runs, you'll see a directory for each named run (in this case, only "test_run"), and below that, a directory for the name of the model configuration ("ne_model" in this case):

Unix:

% ls /tmp/exp/runs/test_run/ne_model

Windows native:

> dir %TMP%\exp\runs\test_run\ne_model

_done			bytoken_excel.csv	hyp
bytag_excel.csv		details.csv		run_input

The important elements here are the individual scoring files bytag_excel.csv and bytoken_excel.csv, which are (approximately) the subset of the corresponding overall scoring files which is relevant to this run. Of greater interest is details.csv, which is the detail spreadsheet for this run. These detail spreadsheets are not aggregated at the top level because they contain an entry for each tag, and the volume of data would likely be too great.

For more details about the structure of the experiment output directory, see here. For detailed examples for the other questions posed above, see the experiment XML documentation.

Step 4: Run an experiment against a workspace

We can treat a workspace, or a portion of a workspace, as a corpus for the experiment engine. If you've done Tutorial 5, and you kept your workspace around, you can run a simple experiment against that workspace using MATWorkspaceEngine:

Unix:

% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace run_experiment \
--test_basename_patterns 'voa3,voa4'--test_step tag

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace run_experiment \
--test_basename_patterns "voa3,voa4" --test_step tag

You can specify an experiment XML file to use, but in this simple example, we're not using one; we're treating all the gold documents as the training corpus, except those which match the --test_basename_patterns (which will be used as the test corpus).

The experiment results will be in the workspace directory in experiments/<date>.

Step 5: Clean up (optional)

Remove your experiment directories:

Unix:

% rm -rf /tmp/exp

Windows native:

> rd /s /q %TMP%\exp

If you're not planning on doing any other tutorials, remove the workspace:

Unix:

% rm -rf /tmp/ne_workspace

Windows native:

> rd /s /q %TMP%\ne_workspace

If you don't want the "Named Entity" task hanging around, remove it as shown in the final step of Tutorial 1.

This concludes Tutorial 6.