Use cases for the XML format for the experiment files (see MATExperimentEngine) are
described here. The reference document is found here. Click here for a split-screen
view.
In most of the examples below, we're going to use the sample "Named Entity" task.
The simplest possible experiment involves a single corpus, a single model, and a single run. Assume you have a set of "gold" documents in /documents/newswire/*.json.
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="ne">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="ne_model">
<training_corpus corpus="ne" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test_run" model="ne_model">
<test_corpus corpus="ne" partition="test"/>
</run>
</runs>
</experiment>
This experiment takes a single set of documents, and designates
80% of the set for training and the remaining 20% for test. It
then generates a single model from the training documents, and
executes a single run using this model against the test documents.
If all your documents have the ".json" extension, and you want to
reuse this experiment XML file, just change the <pattern>
element entry to a relative pathname and use the --pattern_dir
argument when you call MATExperimentEngine.
...
<corpus name="ne">
<pattern>*.json</pattern>
</corpus>
...
Let's say you've set aside a test corpus which you want to hold
constant across a set of experiments, in
/documents/newswire-test/*.json. You can use an experiment XML
file such as this one:
<experiment task='Named Entity'>
<corpora dir="corpora">
<corpus name="train_nw">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
<corpus name="test_nw">
<pattern>/documents/newswire-test/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="train">
<training_corpus corpus="train_nw"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="train">
<test_corpus corpus="test_nw"/>
</run>
</runs>
</experiment>
Here, we have two separate corpora, which are not split; one is
used as a training corpus, and the other as a testing corpus. We
generate one model, and one run.
Sometimes, you want to run the model against the corpus that
produced it. In the example in the previous use case, you can
modify the <runs> as follows:
...
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="train">
<test_corpus corpus="train_nw"/>
</run>
</runs>
...
You can run experiments against workspaces. There are special
workspace capabilities devoted to
experiments, and special experiment engine capabilities
devoted to workspaces.
Let's say you have a workspace, and you've introduced two
basename sets: test_1 and test_2. Furthermore, there are documents
in the workspace which are in neither set. Let's also suppose that
you want to restrict the documents to those which are gold or
reconciled for the (single) trainable step in the workspace.
Finally, let's suppose you have multiple workspaces with these
basename set names, and you want to use bindings to specify the
specific workspace to use. Here's how you use test_1 and test_2
together as your test set, and use the rest of the documents in
the workspace as your training set:
<experiment task='Named Entity'>
<workspace_corpora dir="corpora" workspace_dir="$(WS)"
step_statuses="gold,reconciled">
<workspace_corpus name="test" basename_sets="test_1,test_2"/>
<workspace_corpus name="train" use_remainder="yes"/>
</workspace_corpora>
<model_sets dir="model_sets">
<build_settings partial_training_on_gold_only="yes"/>
<model_set name="train">
<training_corpus corpus="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
<score_args gold_only="yes"/>
</run_settings>
<run name="test" model="train">
<test_corpus corpus="test"/>
</run>
</runs>
</experiment>
Note, too, that we've enhanced the experiment file as recommended
here.
This example uses the "Sample Relations"
task, which contains multiple trainable steps in its main "Mixed
Initiative Annotation" workflow. This
workflow has three trainable steps: entity_tag (for PERSON,
LOCATION, ORGANIZATION), nationality_tag (for NATIONALITY), and
relation_tag (for Employment, Located). Because this task contains multiple trainable steps, you must specify the trainable step in question when you declare your model set.
In our first example, we'll train a single step. We recommend that, rather than tag your documents from
scratch during your run, you use the <prep_args> element to
undo your tags through the step you're testing. Only the trainable steps for which models have been built in the experiment will be scored.
<experiment task='Sample Relations'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="ne">
<pattern>/documents/newswire_relations/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<build_settings trainable_step="nationality_tag"/>
<model_set name="ne_model">
<training_corpus corpus="ne" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<prep_args output_file_type="mat-json" undo_through="nationality_tag" workflow="Mixed Initiative Annotation"/>
<args steps="nationality_tag" workflow="Mixed Initiative Annotation"/>
</run_settings>
<run name="test_run" model="ne_model">
<test_corpus corpus="ne" partition="test"/>
</run>
</runs>
</experiment>
In these multiple-step workflows and tasks, we can train and test against multiple trainable steps at once. Here's how you'd do that for the "nationality_tag" and "entity_tag" steps:
<experiment task='Sample Relations'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="ne">
<pattern>/documents/newswire_relations/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<build_settings trainable_step="nationality_tag"/>
<model_set name="ne_nat_model">
<training_corpus corpus="ne" partition="train"/>
</model_set>
</model_sets>
<model_sets dir="model_sets">
<build_settings trainable_step="entity_tag"/>
<model_set name="ne_ent_model">
<training_corpus corpus="ne" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<prep_args output_file_type="mat-json" undo_through="entity_tag" workflow="Mixed Initiative Annotation"/>
<args steps="entity_tag,nationality_tag" workflow="Mixed Initiative Annotation"/>
</run_settings>
<run name="test_run" model="ne_nat_model,ne_ent_model">
<test_corpus corpus="ne" partition="test"/>
</run>
</runs>
</experiment>
Note that the model attribute of the <run> element accepts a comma-separated sequence of model name, in spite of the attribute name being singular. when there are multiple models for a run, they must all be trained against the same corpus.
You can use the experiment engine to do crossvalidation.
In
crossvalidation, a single corpus is split into N partitions, where N is
the number of crossvalidation folds, and N models are built; for each
model I, the training data consists of the corpus with partition I
removed. The models are tested against the same corpus; each model I is
tested against partition I.
So in this case, the run must use the same corpus as the model(s), and so no corpus may be specified for the run. Your experiment file might look like this:
<experiment task='Named Entity'>
<corpora dir="corpora">
<crossvalidation folds="6"/>
<corpus name="train_nw">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="train">
<training_corpus corpus="train_nw"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="train"/>
</runs>
</experiment>
There are some restrictions on using crossvalidation corpora. First, because it creates its own partitions, you may not specify any partitions for the corpus when defining it. Also, if a model is trained with a crossvalidation corpus, it can be trained with only that training corpus (in normal circumstances, you may use multiple corpora when training a model), and its partitions may not be referred to, and you may not specify the corpus size in the corpus settings, or use the corpus size iterator.
Let's say you have two corpora, and you want to split each of
them 4-to-1, and use the larger slice of each of them, together,
to build a single model, and test against the smaller slice of
each of them, in a single run:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="nw1">
<pattern>/documents/newswire-1/*.json</pattern>
</corpus>
<corpus name="nw2">
<pattern>/documents/newswire-2/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="train">
<training_corpus corpus="nw1" partition="train/>
<training_corpus corpus="nw2" partition="train/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="train">
<test_corpus corpus="nw1" partition="test"/>
<test_corpus corpus="nw2" partition="test"/>
</run>
</runs>
</experiment>
To find out what happens when you use more and more training
data, we add a <corpus_settings> element to
<model_sets>, as follows:
...
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="50"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test"/>
</model_set>
</model_sets>
...
In this case, we're telling the experiment engine to build a
model at 50-document increments. So if the corpus contains 150
documents, the experiment engine will build three models, and
produce one set of three runs.
If your corpus has more than 100 documents, but fewer than 150,
the above values for <build_settings> will only build two
models. If you want a model built for the remainder, use the
"force_last" attribute:
...
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="50" force_last="yes"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test"/>
</model_set>
</model_sets>
...
The jCarafe tagger has the option of biasing precision and recall differently during automated tagging, using the --<step_name>_prior_adjust flag. If you want to compare two decoding strategies, one which biases heavily toward recall and one toward precision, you might do this:
<experiment task='Named Entity'>In this case we have two different <runs> elements, because the run settings differ for the two runs. So we end up with one corpus, one model, and two runs.
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="newswire">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="newswire">
<training_corpus corpus="newswire" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo" carafe_tag_prior_adjust="-3.0"/>
</run_settings>
<run name="recall_bais" model="newswire">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo" carafe_tag_prior_adjust="3.0"/>
</run_settings>
<run name="precision_bias" model="newswire">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
</experiment>
Let's say that in addition to increasing the size of the model,
you also want to know what happens when you increment the number
of model iterations during model building, and you also want to
vary the recall/precision bias of the decoder. You can do all
these things at once, as follows:
...
<model_sets dir="model_sets">
<build_settings>
<iterator type="increment" attribute="max_iterations"
start_val="3" end_val="9" increment="2"/>
<build_settings>
<corpus_settings>
<iterator type="corpus_size" increment="50"/>
</corpus_settings>
<model_set name="train">
<training_corpus corpus="nw1" partition="train/>
<training_corpus corpus="nw2" partition="train/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
<iterator type="increment" attribute="carafe_tag_prior_adjust"
start_val="-3.0" end_val="3.0" increment="1.0"
force_last="yes"/>
</run_settings>
<run name="test" model="train">
<test_corpus corpus="nw1" partition="test"/>
<test_corpus corpus="nw2" partition="test"/>
</run>
</runs>
...
If your training corpus is 150 documents, this experiment will
generate 12 models - for each of 50, 100 and 150 documents, a
model for each of the max_iterations values (3, 5, 7, 9). For each
model, the experiment will conduct 7 runs, one for each of the
values of carafe_tag_prior_adjust. Note that we've used force_last
to force the final value to be used, even if it's not exactly 3.0
(due to the issues with how floats are implemented).
Let's say you have two sets of completed documents: a set of
newswire documents, in /documents/newswire/*.json, and a set of
chat transcripts, in /documents/chat/*.json. Both these document
sets are tagged with the same tag set. If you want to know how a
model built against each will work on the other, here's an
experiment XML file that accomplishes that:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="newswire">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
<corpus name="chat">
<pattern>/documents/chat/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="newswire">
<training_corpus corpus="newswire" partition="train"/>
</model_set>
<model_set name="chat">
<training_corpus corpus="chat" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="nw_train_nw_test" model="newswire">
<test_corpus corpus="newswire" partition="test"/>
</run>
<run name="nw_train_chat_test" model="newswire">
<test_corpus corpus="chat" partition="test"/>
</run>
<run name="chat_train_chat_test" model="chat">
<test_corpus corpus="chat" partition="test"/>
</run>
<run name="chat_train_nw_test" model="chat">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
</experiment>
This experiment XML file will split each corpus 80%/20%, and
build two models, one from each corpus. Finally, it performs a
four-way comparison between the models and the test subsets of the
corpora.
Let's say that you have a jCarafe lexicon directory, as described
in the documentation for MATModelBuilder.
You want to know whether using this lexicon results in a better
model. Here's an experiment XML file which accomplishes that:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="newswire">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="newswire">
<training_corpus corpus="newswire" partition="train"/>
</model_set>
</model_sets>
<model_sets dir="model_sets">
<build_settings lexicon_dir="/documents/newswire_lexicon/"/>
<model_set name="newswire_w_lex">
<training_corpus corpus="newswire" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="w_lex" model="newswire_w_lex">
<test_corpus corpus="newswire" partition="test"/>
</run>
<run name="wo_lex" model="newswire">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
</experiment>
In this case, there are two different <model_sets>
elements, because the build settings for the enclosed models
differ. We have one corpus, two models, and two runs.
You can further specify any of the advanced settings for the
trainer, if you know what you're doing. See MATModelBuilder for whatever
documentation is available.
There may be situations where you have multiple approaches to the
same annotation task, one model-based and the other
regular-expression-based (e.g., the regular expression strategy is
a baseline against which you're comparing your model execution).
In the experiment engine, the model attribute of the <run>
element is optional, to allow you to do this. If you omit the
model attribute, you must provide the score_steps attribute in the
<score_args> element to ensure that the engine knows what
step is being evaluated.
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="newswire">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="newswire">
<training_corpus corpus="newswire" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="newswire" model="newswire">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,re_tag" workflow="Baseline"/>
<score_args score_steps="re_tag"/>
</run_settings>
<run name="newswire_baseline">
<test_corpus corpus="newswire" partition="test"/>
</run>
</runs>
</experiment>
As long as the re_tag step and the tag step add the same
annotation sets, the experiment will create comparable scores
between the two runs.
Sometimes, you may need to do some preprocessing of a corpus.
Let's assume:
To do this during the experiment, you'd use the <prep>
element:
...
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<prep steps="zone,tokenize,align" workflow="Preprocess" input_file_type="xml-inline" xml_input_is_overlay="yes"/>
<corpus name="test">
<pattern>/documents/newswire/*.xml</pattern>
</corpus>
</corpora>
...
Let's say, for instance, that you're working with the MUC
(Message Understanding Conference) corpus, and you're not tagging
the header portions of the documents. Under normal circumstances,
when it prepares an experiment run, the experiment engine converts
the test documents to raw text, and processes them starting from
raw text. However, in this case, you can't actually recreate the
zoning with your own zoner; you need the zoning as it was provided
in the MUC documents. In this situation, you can use the
<prep_args> element in the <run> element to specify a
set of parameters to MATEngine to modify the default test document
preparation:
...
<runs dir="runs">
<run_settings>
<prep_args output_file_type="mat-json" undo_through="tag" workflow="Demo"/>
<args steps="tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
...
Here, instead of undoing all steps by using an output_file_type
of "raw" (which is the default), we undo the "tag" step and use
MAT JSON documents as the inputs to the run; we see that the
<args> for the run only does the "tag" step.
Sometimes, you might want to prepare a corpus ahead of time, with
a fixed partition, a fixed prep phase, or the like. You can use
the experiment engine to create a corpus alone, and then refer to
that corpus elsewhere.
For instance, you might prepare the corpus in the previous use
case with nothing in the <experiment> element except the
<corpora>:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<prep steps="zone,tokenize,align" workflow="Preprocess" input_file_type="mat-json"/>
<corpus name="test">
<pattern>/documents/newswire/*.json</pattern>
</corpus>
</corpora>
</experiment>
Assume we save this XML file to /experiments/xml/corpus.xml, and
output the experiment into /experiments/corpus1:
% cd $MAT_PKG_HOME
% bin/MATExperimentEngine --exp_dir /experiments/corpus1 /experiments/xml/corpus.xml
The corpus will be in the "corpora" subdirectory, in the
subdirectory named "test" (the name of the corpus).
Now, let's refer to it in a different experiment XML file:
<experiment task='Named Entity'>
<corpora dir="corpora">
<corpus name="local_test" source_corpus_dir="/experiments/corpus1/corpora/test"/>
</corpora>
...
</experiment>
Instead of including a <pattern> element, we use the
"source_corpus_dir" attribute. The corpus referred to can itself
have a "source_corpus_dir" attribute (i.e., you can chain them).
Local <prep> or <partition> elements can augment or
override remote elements; the combinations are complex, and you
can find more documentation on them in the experiment XML reference.