The XML format for the experiment files (see MATExperimentEngine) is
described in this document. Use cases are described here. Click here for a split-screen
view.
<experiment>
<binding>
<corpora>
<partition>
<fixed_partition>
<crossvalidation>
<size>
<corpus>
<pattern>
<prep>
<workspace_corpora>
<workspace_corpus>
<partition>
<size>
<fixed_partition>
<crossvalidation>
<model_sets>
<build_settings>
<iterator>
<corpus_settings>
<iterator>
<model_set>
<training_corpus>
<runs>
<run_settings>
<prep_args>
<score_args>
<args>
<iterator>
<run>
<test_corpus>
The toplevel element in the file. Note that none of the five
child elements are obligatory; the experiment XML can be used
simply to build corpora, or to build models, without performing
any experimental runs, if, for instance, you want to build a model
or corpus to be used in multiple experiments.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
The directory in which the
experiment wil be conducted. If the directory does not
exist, it will be created. If not specified, the directory
must be provided when the experiment is run. |
task |
a string |
yes |
The name of a task, as would
be passed to the --task argument of MATEngine. This setting is used
to establish the task for the corpus preparation and for the
experiment runs, and also to establish the set of available
tags for the training. |
language |
a language name or code |
no |
A language name or code as defined in the
task. Obligatory if the task supports multiple languages and
any of the steps executed in the experiment, or any of the
model builders, vary according to the language. The language
can also be provided by the MATExperimentEngine. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<binding> |
no |
yes |
Bindings to be made globally
available in the other elements. |
<corpora> |
no |
yes |
The corpora to be used in the
experiment. |
<workspace_corpora> |
no |
yes |
The corpora to be used in the
experiment that will be drawn from workspaces. |
<model_sets> |
no |
yes |
The model sets to be used in
the experiment. |
<runs> |
no |
yes |
The experimental runs to be
used in the experiment. |
This element allows the user to define global bindings which can
be referred to in any other element of the experiment XML file
(except the attributes of the <experiment> element itself,
and the <binding> elements). These bindings can be referred
to either in XML attributes or in text within XML elements. The
pattern for each binding is $(...). The experiment directory,
whether provided via the dir attribute of the <experiment>
element or on the command line, is provided as EXP_DIR; the
pattern directory, if provided by the --pattern_dir command line
argument to MATExperimentEngine,
is provided as PATTERN_DIR.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The binding to be replaced.
The engine will look for $(<name>) anywhere in the
attribute values or text in the experiment XML file. |
value |
a string |
yes |
The value to replace
$(<name>) with. This replacement is not recursive;
that is, you should not include any $(<name>)
substrings in your value, unless you want them to be
included literally, because they will not be expanded. |
Describes corpora to be used in the experiment. This element may
be repeated; the intention is that a single <corpora>
element will correspond to a shared set of preprocessing
instructions.
The corpora may be local, in which case a set of patterns should
be provided, or remote, in which case the source_corpus_dir
attribute should be provided. Remote corpora are used directly
unless one or more of the processing tags are specified
(<partition>, <fixed_partition>, <prep>,
<crossvalidation>). In this case, the specified processing
steps are added or redone locally, on a separate copy of the
corpus. For instance, if the remote corpus is split into test and
train, but not preprocessed, and the <prep> tag is specified
here, the corpus documents will be postprocessed here, and the
remote split will be preserved. If the remote corpus is
preprocessed and split, but the local <partition> tag
specifies that the corpus type is "train", the remote corpus
preprocessing will be preserved, but locally the split will be
ignored. If the remote corpus contains enough patterns for 300
documents, but max_size remotely is 100 and max_size locally is
200, the local max_size will be used; this is possible because all
the documents are preprocessed by default when a corpus is
prepared, regardless of max_size, and the order of documents
(after an initial randomization) is preserved from remote corpus
to local copy.
Note that inside the experiment engine, MAT uses the MAT JSON
document format exclusively. Therefore, if you want to provide
documents which are in a different format which MAT also
understand (e.g., XML inline), you must use the <prep> tag
to convert the documents to MAT JSON format.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
The directory where the
corpora are found, or should be built. If the directory does
not exist, it will be created. The default value for this
attribute is a subdirectory named "corpora" in the
experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<partition> |
no |
yes |
The proportional partition
settings for this group of corpora. If neither this nor
<fixed_partition> is present, the corpus will not have
any partitions. You cannot mix fixed and proportional
partitions and crossvalidation within a given
<corpora> element. |
<fixed_partition> |
no |
yes |
The fixed partition settings
for this group of corpora. If neither this nor
<partition> is present, the corpus will not have any
partitions. You cannot mix fixed and proportional partitions
and crossvalidation within a given <corpora> element. |
<crossvalidation> |
no |
no |
The crossvalidation settings for this group of corpora. If this element is present, the corpora will be prepared for crossvalidation. You cannot mix fixed and proportional partitions and crossvalidation within a given <corpora> element. |
<size> |
no |
no |
The size settings for this
group of corpora. If the source_corpus_dir attribute of any of the sister <corpus> nodes is set, the values for <size> override those in the source corpus (i.e., a new max_size for the corpus might be established). |
<corpus> |
yes |
yes |
The individual corpora in
this group. |
<prep> |
no |
no |
The arguments to the MATEngine command to use to
preprocess the corpora. For instance, this command might
take documents which have been deidentified and resynthesize
fillers for the deidentified regions. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through. |
Specifies a proportional partition of the sister corpora
specified with the <corpus> tag. May be repeated. If there
are no instances of this element or <fixed_partition>, the
corpus has no partitions. The proportional partitions segment the
entire corpus, so the fraction values are normalized to shares of
the corpus. If you want just a 10th of the corpus, for instance,
you must divide the corpus into two partitions at a ratio of 9:1
and ignore the larger slice.
Each <corpora> element is may have either proportional or fixed partitions, or be designated as a crossvalidation corpus, but not more than one of these.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the partition. |
fraction |
a float |
yes |
The share of each corpus that
should be allotted to this partition (a float between 0 and
1). |
Specifies a fixed partition of the sister corpora specified with
the <corpus> tag. May be repeated. If there are no instances
of this element or <partition>, the corpus has no
partitions. These partitions do not
segment the entire corpus. If you want a fixed partition to
encompass "everything else", use the special "remainder" value as
described below.
Each <corpora> element is may have either proportional or fixed partitions, or be designated as a crossvalidation corpus, but not more than one of these.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the partition. |
size |
an integer or the string
"remainder" |
yes |
The portion of each corpus
that should be allotted to this partition (either an integer
number of documents or the string "remainder"). The
"remainder" value can only be used once in a <corpora>
element. |
Indicates that the corpus is a crossvalidation corpus, and allows
the specification of the number of crossvalidation folds.
A crossvalidation corpus is used by the experiment engine for
crossvalidation. Internally, it is split into <n>
partitions, where <n> is the number of folds. A model which
is trained against a crossvalidation corpus actually generates
<n> models, where the training data for each model <i>
is the crossvalidation corpus with partition <i> removed.
When the model is used in a run, the run must also be applied to
the crossvalidation corpus, and each model <i> is applied to
partition <i>. The aggregated results are treated as a
single run.
Each <corpora> element is may have either proportional or
fixed partitions, or be designated as a crossvalidation corpus,
but not more than one of these.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
folds |
an integer |
yes |
The number of crossvalidation
folds |
Specifies the size properties of the sister corpora specified
with the <corpus> tag. If this tag is present, and the
sister corpus has the source_corpus_dir attribute set, the
specified values will override those in the source corpus.
This element is not compatible with <crossvalidation>.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
max_size |
an integer |
no |
The maximum number of
documents in each corpus. If specified, each corpus will not
exceed this number. This limit is applied last, so the
corpus can be reused with a greater max_size specified if
the requisite number of documents are available. |
truncate_document_list |
"yes" |
no |
If present and max_size is
also present, the max_size limit will be imposed first,
rather than last. The consequence of this is that no more
than max_size documents will be available to remote accesses
of this corpus. |
A corpus is specified either by a set of patterns, or by a
reference to another corpus (via source_corpus_dir). The documents
specified by a set of patterns are randomly reordered before any
subsequent processing is performed (e.g., split, preprocess). If
source_corpus_dir is present, patterns are ignored.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the corpus, for
subsequent reference in the remainder of the experiment. It
is also used as the name of the subdirectory in which this
corpus is built, if the corpus is either local (i.e., it has
patterns), or remote with processing overrides. |
source_corpus_dir |
a string |
no |
If present, a pathname of an
existing corpus directory. If the path is not an absolute
path, the experiment directory will be prepended. The corpus found in this directory will be used as the input to further local processing. If present, the <pattern> children are ignored. Source corpora can themselves have source_corpus_dir attributes; in other words, you can create chains of source corpora. If the current corpus is in a <corpora> tag that has a <prep> tag, the local <prep> tag command line will be applied to the output of the source corpus (so you can chain prep commands if you want). The most local <partition> attributes will be used (that is, the attributes closest to this corpus in the source corpus chain). Since corpora are created and loaded in the order they're listed in an experiment file, you can use source_corpus_dir to point to a corpus in the same experiment file. The path would be [experiment_dir]/corpora/[corpus_name], if the "dir" attribute is not set on the <corpora> tag which dominates the corpus you're referring to; if it is, the path would be [corpora_dir_attribute_value]/[corpus_name]. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<pattern> |
yes |
yes |
A glob-style pattern of files
to use to construct this corpus. "Glob" style is the UNIX
shell file pattern matching; e.g., "*" matches everything.
(This is in contrast to standard regular expressions.) If
this path pattern isn't an absolute path, the --pattern_dir
option of MATExperimentEngine
must be used to provide the location of the patterns. This element has no attributes or element children; its value is the text it delimits. |
This element houses the arguments to the MATEngine command to use to preprocess
the corpora. You might use this command to take documents which
have been deidentified and resynthesize fillers for the
deidentified regions.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files. If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through. |
In order to run experiments against workspaces,
we've introduced a new element which allows you to specify a
collection of corpora to be drawn from a workspace, restricted by
various dimensions of the workspace contents. Each
<workspace_corpora> element establishes a context for
a workspace, by describing a subset of eligible workspace
files, and within that context you can define
<workspace_corpus> elements which are ultimately transformed
into the same objects which the <corpus> elements correspond
to.
Each workspace corpus set must have a single trainable step in
the workspace's workflow which each document has reached. If the
workflow has more than one trainable step, you must specify the
"step" attribute.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory. |
workspace_dir |
a pathname |
yes |
The directory root of the
workspace. |
step |
a string |
no |
If present, a trainable step in the workflow
associated with the workspace. Obligatory if the workspace
has more than one trainable step. All documents which have
at least reached this step will be included. |
step_statuses |
a string |
no |
If present, a comma-separated
list of step_statuses.
The default is "partially corrected,partially
gold,gold,reconciled". Any document whose current step is
the target trainable step which doesn't have one of these
statuses in the workspace will be excluded. |
users |
a string |
no |
If present, a comma-separated
list of workspace users.
Any document which is assigned to a user which isn't one of
these users will be excluded. |
include_unassigned |
"no" |
no |
If present, documents which
are not assigned to any workspace user will be excluded. |
basename_sets |
a string |
no |
If present, a comma-separated
list of workspace basename
sets. Any document which is not in one of these
basename sets in the given workspace will be excluded. |
basename_patterns |
a string |
no |
If present, a comma-separated
list of glob-style patterns to match the workspace basenames
against. Any document whose basename does not match one of
the patterns will be excluded. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<workspace_corpus> |
yes |
yes |
An actual corpus which will
be created in this workspace context. |
Each <workspace_corpus> element defines a set of files,
similar to a <corpus> element. Because its common context is
a workspace, we've chosen to allow the partition, crossvalidation
and size information to be specified independently for each of
these elements, rather than establishing them in the common
workspace context (as <corpora> does for <corpus>).
Within the workspace context, you can further restrict each corpus
in the same way as you restrict the context itself, and also
identify a special, unique corpus as containing the remainder of
the files in the workspace context. Note that you can't further
filter the workspace's workflow step; this is specifiable only in
the parent element.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of the corpus, for subsequent reference in the remainder of the experiment. It is also used as the name of the subdirectory in which this corpus is built. |
step_statuses |
a string |
no |
If present, a comma-separated
list of step_
statuses. Any document in the workspace context whose
current step is the target trainable step which doesn't have
one of these statuses will be excluded. |
users |
a string |
no |
If present, a comma-separated
list of workspace users.
Any document in the workspace context which is assigned to a
user which isn't one of these users will be excluded. |
include_unassigned |
"no" |
no |
If present, documents in the
workspace context which are not assigned to any workspace
user will be excluded. |
basename_sets |
a string |
no |
If present, a comma-separated
list of workspace basename
sets. Any document in the workspace context which is
not in one of these basename sets will be excluded. |
basename_patterns |
a string |
no |
If present, a comma-separated
list of glob-style patterns to match the workspace basenames
against. Any document in the workspace context whose
basename does not match one of the patterns will be
excluded. |
use_remainder |
"yes" |
no |
If present, the corpus
consists of all those documents in the document context
which are not included in any sibling
<workspace_corpus>. This attribute-value pair must
occur without any of the other qualifying attributes. |
The semantics of these elements are identical to the semantics of
these elements in the scope of the <corpora> element.
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<partition> |
no |
yes |
The proportional partition settings for this workspace corpus. If neither this nor <fixed_partition> is present, the corpus will not have any partitions. You cannot mix fixed and proportional partitions and crossvalidation within a given <workspace_corpus> element. |
<fixed_partition> |
no |
yes |
The fixed partition settings
for this workspace corpus. If neither this nor
<partition> is present, the corpus will not have any
partitions. You cannot mix fixed and proportional partitions
and crossvalidation within a given <workspace_corpus>
element. |
<crossvalidation> |
no |
no |
The crossvalidation settings for this workspace corpus. If this element is present, the corpus will be prepared for crossvalidation. You cannot mix fixed and proportional partitions and crossvalidation within a given <workspace_corpus> element. |
<size> |
no |
no |
The size settings for this
workspace corpus. |
See <partition> of
<corpora> for details.
See <fixed_partition>
of <corpora> for details.
See <crossvalidation> of <corpora> for details.
See <size> of <corpora>
for details.
Each experiment also can contain a number of model sets. A model
set is a sequence of models built out of the same corpus, with
successively larger numbers of training inputs. This iterative
capability does not have to be used, but is available if the user
wants to track the change in performance relative to the number of
training documents.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<build_settings> |
no |
no |
The instructions for building
the model sets in this bundle. |
<model_set> |
yes |
yes |
A model set. |
In order to run an experiment, you must have declared the
appropriate <model_config> in your task.xml file.
While the training engine you're most likely to use is the jCarafe engine, your task may have
multiple trainable steps and multiple trainable engines, and
you'll need to ensure that the proper one is chosen. If you have
multiple trainable steps, you must specify the trainable step
explicitly using the "trainable_step" attribute.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
config_name |
a string |
no |
By default, the settings here
will override the attribute values for the default model
build settings in task.xml. If this attribute is present,
the experiment engine will look for the model build settings
with the specified config_name. |
trainable_step |
a string |
no |
The name of a step in your task.xml file
which has an engine which is trainable. Specify this
attribute if there are multiple trainable steps in your
task. The model builder will infer the annotation sets to
train for from this step. |
<attr> |
a string |
no |
An attribute-value pair which
overrides or sets the attribute values for your chosen model
configuration. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<iterator> |
no |
yes |
An iterator which can be
applied to create a sequence of models for these model sets. |
It is possible to iterate through a set of values for the model
builder using an iterator.
For instance, the default jCarafe
engine allows you to customize the degree of L1
regularization (see the jCarafe documentation for details). You
might want to build a series of models exploring the effects of L1
regularization values from 0.0 through 2.0, at increments of .2.
Or, you might want to vary the number of training iterations the
engine performs from 10 to 150 at increments of 10.
You can specify multiple iterators, and you'll get the
cross-product of the settings. The iterator mechanism is flexible
enough that you can build
iterators which depend on the last model built, or iterators
which specify their own model builder class (if, for example, they
need to do some extensive computation on the training corpus
before they train).
Corpus setting iterators and build setting iterators are both
applied to the model, and the cross-product of the possible values
is used. The corpus setting iterators are applied first.
The build settings support two built-in iterators, which can be
configured using the <iterator> element.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
type |
a string |
yes |
Either one of the two
predefined iterator types "value" or "increment", or the
name of an iterator class defined in your task's Python
library. |
<attr> |
a string |
no |
An attribute-value pair which
configures the given iterator. The available attributes and
values for the two predefined iterator types is listed
immediately below. |
Here are the attributes and values available for the "value"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training
engine attribute you're iterating over. This attribute would
be one that could be set in the <build_settings>
element above or in the <model_config> element in
task.xml. |
values |
a string |
yes |
A comma-delimited set of
values to iterator over. |
value_type |
one of "float", "str", "int" |
no |
The type of the values,
either strings, integers, or floats. Default is "str"
(string). |
Here are the attributes and values available for the "increment"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <build_settings> element above or in the <model_config> element in task.xml. |
start_val |
an integer or float |
yes |
The initial value for the
incrementer. |
end_val |
an integer or float |
yes |
The final value for the
incrementer. |
increment |
an integer or float |
yes |
The increment to add to value
on each iteration. |
force_last |
"yes" |
no |
If present, force the last
value to be processed, even if it's not exactly an
increment. For instance, if you're incrementing model
iterations from 20 to 150 by increments of 20, the last
value processed will be 140, unless you provide this
setting. This setting is also useful with float values, due
to the way programming languages like Python deal with
floats; asking to increment from .1 to .5 by .1 may or may
not give you exactly .5 as the final value, so you might
want to use this setting to force whatever value it is
(e.g., .500000000001) to be processed. |
In addition to the model builder itself, you can configure
properties of the training corpus as well. At the moment, the only
property of the training corpus you can configure is its size, and
this is mostly in service of the iterator over corpus size.
When you specify the size of a corpus, it will truncate the
training corpus to the specified length.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
size |
an integer |
no |
By default, the corpus size
is defined in the <corpora> element. However, you can
further specify the size here, if you want the corpus to be
even smaller than what's specified in <corpora>, or if
your training corpus is a union of a number of different
corpora (see the <training_corpus> element below). |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<iterator> |
no |
yes |
An iterator which can be
applied to create a sequence of corpora for these model
sets. |
It is possible to iterate through a set of values for the model
corpus using an iterator.
Right now, the only available iterator is the "corpus_size"
iterator (although you can also define
your own if you need to).
Corpus setting iterators and build setting iterators are both
applied to the model, and the cross-product of the possible values
is used. The corpus setting iterators are applied first.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
type |
a string |
yes |
Either the predefined
iterator type "corpus_size", or the name of an iterator
class defined in your task's Python library. |
<attr> |
a string |
no |
An attribute-value pair which
configures the given iterator. The available attributes and
values for the "corpus_size" iterator type is listed
immediately below. |
Here are the attributes and values available for the
"corpus_size" iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
start_val |
an integer |
no |
The initial value for the
corpus size. Defaults to the increment. |
end_val |
an integer |
no |
The final value for the
corpus size. Defaults to the size of the corpus. |
increment |
an integer |
yes |
The corpus size increment for
each iteration. |
force_last |
"yes" |
no |
If present, force the last
value to be processed, even if it's not exactly an
increment. So if the corpus has 176 documents in it, and
you've specified an increment of 20, the last corpus size
that will be processed is 160, unless this option is
specified. |
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of this model set,
for subsequent reference in this experiment. It is also used
as the name of the subdirectory in which this model set is
built. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<training_corpus> |
yes |
yes |
One or more corpora (and
possibly partitions of corpora) which should be used to
construct this model set. |
May be repeated. Specifies the training corpora to use in
building this model set. Each corpus is referred to by name, and
an optional partition name. Note that the model will not use a
document path more than once, so if you specify multiple corpora,
and the corpora overlap, only one instance of each path will be
used.
If the corpus is a crossvalidation corpus, there can be only one
training corpus, and its partitions may not be referred to. In
addition, you may not specify the corpus size in the corpus
settings, or use the corpus size iterator.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
corpus |
a string |
yes |
The name of the training
corpus to use. This name must match the "name" attribute of
some <corpus> element in the experiment file. |
partition |
a string |
no |
If present, the name of a
partition in the specified corpus, which must match the
"name" attribute of some <partition> element in the
corpus. If not present, the entire corpus will be used. |
The experiment also can have a set of runs. The runs in each
<runs> element share a set of run settings. Whenever the
experiment is run, each <run> is scored, whether or not it's
been scored before. This is a convenient way of reviewing the
scores after an experiment is finished.
The scores produced by the experiment engine reflect only
the annotations created, and attributes set, by the step which the
model for the run was built for. It is equivalent to passing the
annotation sets for that step as the --content_annotation_sets
option to MATScore.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
dir |
a pathname |
no |
If present, the directory where the runs can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "runs" in the experiment directory. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<run_settings> |
yes |
no |
A container for the arguments
to run the processing engine with. |
<run> |
yes |
yes |
An experimental run. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<args> |
yes |
no |
The arguments to the MATEngine to use for these
experiment runs. |
<prep_args> |
no |
no |
The arguments to the
MATEngine to use to prepare the annotated documents for the
experiment runs. By default, the documents are converted to
raw documents, but if instead you want to just undo a step
and leave them as MAT JSON documents, you can use this
element to achieve that. |
<score_args> |
no |
no |
Flags which control the
behavior of the scorer. |
This element houses the arguments to the MATEngine command to perform the
experiment runs.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <args>: input_file_type output_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files. |
This element houses the arguments to the MATEngine command to to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
<attr> |
a string |
no |
An attribute-value pair which
corresponds to a command-line option to MATEngine. The output_file_type attribute must be specified (you're restricted to mat-json and raw). The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <prep_args>: input_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files. |
These flags control the behavior of the scorer. They are not yet generally processed the way <prep_args> and <args> are; they're individually defined and handled, for the moment.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
gold_only |
"yes" |
no |
Equivalent to the
--ref_gold_only option of MATScore.
If this attribute is provided, only gold or reconciled
segments in the reference will be used for scoring
comparison. This is particularly useful when defining
experiment files to be used with workspaces. |
similarity_profile |
a string |
no |
Equivalent to the --similarity_profile option
of MATScore. |
score_profile |
a string |
no |
Equivalent to the --score_profile option of MATScore. |
score_steps |
a comma-separated string |
no |
If no model attribute is provided to the
<run> element below, you must provide a
comma-separated list of task steps to inform the scorer
which steps are being evaluated. |
It is possible to iterate through a set of values for the run
using an iterator. For
instance, the default jCarafe engine
allows you to customize the recall/precision bias (see the jCarafe
documentation for details). You might want to build a series of
models exploring the effects of recall/precision bias values from
-2.0 through 2.0, at increments of .5.
You can specify multiple iterators, and you'll get the
cross-product of the settings. The iterator mechanism is flexible
enough that you can build
your own iterators if you need them.
The run settings support two built-in iterators, which can be
configured using the <iterator> element.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
type |
a string |
yes |
Either one of the two
predefined iterator types "value" or "increment", or the
name of an iterator class defined in your task's Python
library. |
<attr> |
a string |
no |
An attribute-value pair which
configures the given iterator. The available attributes and
values for the two predefined iterator types is listed
immediately below. |
Here are the attributes and values available for the "value"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training
engine attribute you're iterating over. This attribute would
be one that could be set in the <run_settings> element
above or in the <run_settings> element in workflow
steps in task.xml. |
values |
a string |
yes |
A comma-delimited set of
values to iterator over. |
value_type |
one of "float", "str", "int" |
no |
The type of the values,
either strings, integers, or floats. Default is "str"
(string). |
Here are the attributes and values available for the "increment"
iterator:
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
attribute |
a string |
yes |
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <run_settings> element above or in the <run_settings> element in for workflow steps task.xml. |
start_val |
an integer or float |
yes |
The initial value for the
incrementer. |
end_val |
an integer or float |
yes |
The final value for the
incrementer. |
increment |
an integer or float |
yes |
The increment to add to value
on each iteration. |
force_last |
"yes" |
no |
If present, force the last
value to be processed, even if it's not exactly an
increment. For instance, if you're incrementing
recall/precision bias from -2.0 to 2.2 by increments of .5,
the last value processed will be 2.0, unless you provide
this setting. This setting is also useful with float values
even if it appears that the endpoints match the increment
precisely, due to the way programming languages like Python
deal with floats; asking to increment from .1 to .5 by .1
may or may not give you exactly .5 as the final value, so
you might want to use this setting to force whatever value
it is (e.g., .500000000001) to be processed. |
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
name |
a string |
yes |
The name of this experimental
run. This is used as the name of the subdirectory in which
this run is conducted. |
model |
a string |
no |
A comma-separated sequence of
names of models to use. Each name must match the "name"
value of some <model_set> element in the experiment
file. If there are multiple names, the corresponding model
sets must all have been trained on the same corpus and
partition. If no models are provided, then the score_steps
attribute of <score_args> must be specified in order
to inform the scorer which steps are being evaluated. |
Element |
Obligatory? |
Repeatable? |
Description |
---|---|---|---|
<test_corpus> |
no |
yes |
One or more test corpora (and
possibly partitions of corpora) to use in this run. This
element is almost always required, except when the models
are crossvalidation models, in which case it is forbidden. |
May be repeated. One or more test corpora (and possibly
partitions of corpora) to use in this run.
Attribute |
Value |
Obligatory? |
Description |
---|---|---|---|
corpus |
a string |
yes |
The name of the test corpus
to use. This string must match the "name" value of some
<corpus> element in the experiment file. |
partition |
a string |
no |
If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used. |