At any point, you might have a number of questions about the
state of your corpus.
These and other questions can be answered easily by the
experiment engine, MATExperimentEngine.
The power of the experiment engine lies largely in its rich XML configuration. In this
tutorial, we'll learn how to use the experiment engine to answer
one of the questions above, and you can examine the other
documentation to see how you might answer other questions, as
illustrated in the use
cases.
We're going to use the same simple 'Named Entity' task, and we're going to assume that your task is installed. This tutorial involves both the UI and the command line. Because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.
This step is fairly easy, because the XML file to answer the
first question is included as part of the distribution. The XML
file is found in MAT_PKG_HOME/sample/ne/test/exp/exp.xml, and it
looks like this:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="ne">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="ne_model">
<training_corpus corpus="ne" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test_run" model="ne_model">
<test_corpus corpus="ne" partition="test"/>
</run>
</runs>
</experiment>
This is one of the simplest complete experiment XML files you can
create. As with all experiment XML files, it describes three types
of entities.
In most cases, your corpora should consist exclusively of
annotated documents which have been marked gold.
So this experiment takes a single set of documents, and
designates 80% of the set for training and the remaining 20% for
test. It then generates a single model from the training
documents, and executes a single run using this model against the
test documents.
This operation is a command-line operation. Try it:
Unix:
% cd $MAT_PKG_HOME
% bin/MATExperimentEngine --exp_dir /tmp/exp \
--pattern_dir $PWD/sample/ne/resources/data/json sample/ne/test/exp/exp.xml
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATExperimentEngine.cmd --exp_dir %TMP%\exp \
--pattern_dir %CD%\sample\ne\resources\data\json sample\ne\test\exp\exp.xml
The --exp_dir is the directory where the corpora, models and runs
will be computed (and stored, if necessary), and where the results
will be found. The --pattern_dir is the directory in which to look
for the files referred to in the <pattern> elements in the
experiment XML file; the patterns are so-called Unix "glob"
patterns, which are standard file patterns which should be
familiar to any user of the Unix shell. The final argument is the
experiment XML file itself.
The engine will create the directory, copy the experiment XML
file into it for archive purposes, and then run the experiment as
described in step 1.
Look in the experiment directory.
Unix:
% ls /tmp/exp
Windows native:
> dir %TMP%/exp
allbytag_excel.csv corpora model_sets
allbytoken_excel.csv exp.xml runs
The corpora, model_sets and runs subdirectories are as specified
in the experiment XML file above (that's what the "dir" attribute
does). What you'll be most interested in are the files
allbytag_excel.csv and allbytoken_excel.csv. These files contain
the tag-level and token-level scoring results (including
Excel-style formulas) for all the runs. The format and
interpretation of these results is found in the documentation for
the scoring output, except that
the initial columns are different; you can find a description of
the differences in the documentation for MATExperimentEngine.
Under /tmp/exp/runs, you'll see a directory for each named run
(in this case, only "test_run"), and below that, a directory for
the name of the model configuration ("ne_model" in this case):
Unix:
% ls /tmp/exp/runs/test_run/ne_model
Windows native:
> dir %TMP%\exp\runs\test_run\ne_model
_done bytoken_excel.csv hyp
bytag_excel.csv details.csv run_input
The important elements here are the individual scoring files
bytag_excel.csv and bytoken_excel.csv, which are (approximately)
the subset of the corresponding overall scoring files which is
relevant to this run. Of greater interest is details.csv, which is
the detail spreadsheet for this run. These detail spreadsheets are
not aggregated at the top level because they contain an entry for
each tag, and the volume of data would likely be too great.
For more details about the structure of the experiment output
directory, see here. For
detailed examples for the other questions posed above, see the experiment XML documentation.
We can treat a workspace, or a portion of a workspace, as a
corpus for the experiment engine. If you've done Tutorial 5, and you kept your
workspace around, you can run a simple experiment against that
workspace using MATWorkspaceEngine:
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace run_experiment \
--test_basename_patterns 'voa3,voa4'--test_step tag
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace run_experiment \
--test_basename_patterns "voa3,voa4" --test_step tag
You can specify an experiment XML file to use, but in this simple
example, we're not using one; we're treating all the gold
documents as the training corpus, except those which match the
--test_basename_patterns (which will be used as the test corpus).
The experiment results will be in the workspace directory in
experiments/<date>.
Unix:If you're not planning on doing any other tutorials, remove the workspace:
% rm -rf /tmp/exp
Windows native:
> rd /s /q %TMP%\exp
Unix:
% rm -rf /tmp/ne_workspace
Windows native:
> rd /s /q %TMP%\ne_workspace
If you don't want the "Named Entity" task hanging around, remove
it as shown in the final step of Tutorial
1.