Experiment engine XML reference

The XML format for the experiment files (see MATExperimentEngine) is described in this document. Use cases are described here. Click here for a split-screen view.

Element hierarchy

<experiment>
<binding>
<corpora>
<partition>
<fixed_partition>
<crossvalidation>
<size>
<corpus>
<pattern>
<prep>
<workspace_corpora>
<workspace_corpus>
<partition>
<size>
<fixed_partition>
<crossvalidation>
<model_sets>
<build_settings>
<iterator>
<corpus_settings>
<iterator>
<model_set>
<training_corpus>
<runs>
<run_settings>
<prep_args>
<score_args>
<args>
<iterator>
<run>
<test_corpus>

<experiment>

The toplevel element in the file. Note that none of the five child elements are obligatory; the experiment XML can be used simply to build corpora, or to build models, without performing any experimental runs, if, for instance, you want to build a model or corpus to be used in multiple experiments.

Attributes

Attribute
Value
Obligatory?
Description
dir
a pathname
no
The directory in which the experiment wil be conducted. If the directory does not exist, it will be created. If not specified, the directory must be provided when the experiment is run.
task
a string
yes
The name of a task, as would be passed to the --task argument of MATEngine. This setting is used to establish the task for the corpus preparation and for the experiment runs, and also to establish the set of available tags for the training.
language
a language name or code
no
A language name or code as defined in the task. Obligatory if the task supports multiple languages and any of the steps executed in the experiment, or any of the model builders, vary according to the language. The language can also be provided by the MATExperimentEngine.

Children

Element
Obligatory?
Repeatable?
Description
<binding>
no
yes
Bindings to be made globally available in the other elements.
<corpora>
no
yes
The corpora to be used in the experiment.
<workspace_corpora>
no
yes
The corpora to be used in the experiment that will be drawn from workspaces.
<model_sets>
no
yes
The model sets to be used in the experiment.
<runs>
no
yes
The experimental runs to be used in the experiment.

<binding> (of <experiment>)

This element allows the user to define global bindings which can be referred to in any other element of the experiment XML file (except the attributes of the <experiment> element itself, and the <binding> elements). These bindings can be referred to either in XML attributes or in text within XML elements. The pattern for each binding is $(...). The experiment directory, whether provided via the dir attribute of the <experiment> element or on the command line, is provided as EXP_DIR; the pattern directory, if provided by the --pattern_dir command line argument to MATExperimentEngine, is provided as PATTERN_DIR.

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The binding to be replaced. The engine will look for $(<name>) anywhere in the attribute values or text in the experiment XML file.
value
a string
yes
The value to replace $(<name>) with. This replacement is not recursive; that is, you should not include any $(<name>) substrings in your value, unless you want them to be included literally, because they will not be expanded.

<corpora> (of <experiment>)

Describes corpora to be used in the experiment. This element may be repeated; the intention is that a single <corpora> element will correspond to a shared set of preprocessing instructions.

The corpora may be local, in which case a set of patterns should be provided, or remote, in which case the source_corpus_dir attribute should be provided. Remote corpora are used directly unless one or more of the processing tags are specified (<partition>, <fixed_partition>, <prep>, <crossvalidation>). In this case, the specified processing steps are added or redone locally, on a separate copy of the corpus. For instance, if the remote corpus is split into test and train, but not preprocessed, and the <prep> tag is specified here, the corpus documents will be postprocessed here, and the remote split will be preserved. If the remote corpus is preprocessed and split, but the local <partition> tag specifies that the corpus type is "train", the remote corpus preprocessing will be preserved, but locally the split will be ignored. If the remote corpus contains enough patterns for 300 documents, but max_size remotely is 100 and max_size locally is 200, the local max_size will be used; this is possible because all the documents are preprocessed by default when a corpus is prepared, regardless of max_size, and the order of documents (after an initial randomization) is preserved from remote corpus to local copy.

Note that inside the experiment engine, MAT uses the MAT JSON document format exclusively. Therefore, if you want to provide documents which are in a different format which MAT also understand (e.g., XML inline), you must use the <prep> tag to convert the documents to MAT JSON format.

Attributes

Attribute
Value
Obligatory?
Description
dir
a pathname
no
The directory where the corpora are found, or should be built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "corpora" in the experiment directory.

Children

Element
Obligatory?
Repeatable?
Description
<partition>
no
yes
The proportional partition settings for this group of corpora. If neither this nor <fixed_partition> is present, the corpus will not have any partitions. You cannot mix fixed and proportional partitions and crossvalidation within a given <corpora> element.
<fixed_partition>
no
yes
The fixed partition settings for this group of corpora. If neither this nor <partition> is present, the corpus will not have any partitions. You cannot mix fixed and proportional partitions and crossvalidation within a given <corpora> element.
<crossvalidation>
no
no
The crossvalidation settings for this group of corpora. If this element is present, the corpora will be prepared for crossvalidation. You cannot mix fixed and proportional partitions and crossvalidation within a given <corpora> element.
<size>
no
no
The size settings for this group of corpora.

If the source_corpus_dir attribute of any of the sister <corpus> nodes is set, the values for <size> override those in the source corpus (i.e., a new max_size for the corpus might be established).
<corpus>
yes
yes
The individual corpora in this group.
<prep>
no
no
The arguments to the MATEngine command to use to preprocess the corpora. For instance, this command might take documents which have been deidentified and resynthesize fillers for the deidentified regions.

The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files.

If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through.

<partition> (of <corpora>)

Specifies a proportional partition of the sister corpora specified with the <corpus> tag. May be repeated. If there are no instances of this element or <fixed_partition>, the corpus has no partitions. The proportional partitions segment the entire corpus, so the fraction values are normalized to shares of the corpus. If you want just a 10th of the corpus, for instance, you must divide the corpus into two partitions at a ratio of 9:1 and ignore the larger slice.

Each <corpora> element is may have either proportional or fixed partitions, or be designated as a crossvalidation corpus, but not more than one of these.

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of the partition.
fraction
a float
yes
The share of each corpus that should be allotted to this partition (a float between 0 and 1).

<fixed_partition> (of <corpora>)

Specifies a fixed partition of the sister corpora specified with the <corpus> tag. May be repeated. If there are no instances of this element or <partition>, the corpus has no partitions. These partitions do not segment the entire corpus. If you want a fixed partition to encompass "everything else", use the special "remainder" value as described below.

Each <corpora> element is may have either proportional or fixed partitions, or be designated as a crossvalidation corpus, but not more than one of these.

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of the partition.
size
an integer or the string "remainder"
yes
The portion of each corpus that should be allotted to this partition (either an integer number of documents or the string "remainder"). The "remainder" value can only be used once in a <corpora> element.

<crossvalidation> (of <corpora>)

Indicates that the corpus is a crossvalidation corpus, and allows the specification of the number of crossvalidation folds.

A crossvalidation corpus is used by the experiment engine for crossvalidation. Internally, it is split into <n> partitions, where <n> is the number of folds. A model which is trained against a crossvalidation corpus actually generates <n> models, where the training data for each model <i> is the crossvalidation corpus with partition <i> removed. When the model is used in a run, the run must also be applied to the crossvalidation corpus, and each model <i> is applied to partition <i>. The aggregated results are treated as a single run.

Each <corpora> element is may have either proportional or fixed partitions, or be designated as a crossvalidation corpus, but not more than one of these.

Attributes

Attribute
Value
Obligatory?
Description
folds
an integer
yes
The number of crossvalidation folds

<size> (of <corpora>)

Specifies the size properties of the sister corpora specified with the <corpus> tag. If this tag is present, and the sister corpus has the source_corpus_dir attribute set, the specified values will override those in the source corpus.

This element is not compatible with <crossvalidation>.

Attributes

Attribute
Value
Obligatory?
Description
max_size
an integer
no
The maximum number of documents in each corpus. If specified, each corpus will not exceed this number. This limit is applied last, so the corpus can be reused with a greater max_size specified if the requisite number of documents are available.
truncate_document_list
"yes"
no
If present and max_size is also present, the max_size limit will be imposed first, rather than last. The consequence of this is that no more than max_size documents will be available to remote accesses of this corpus.

<corpus> (of <corpora>)

A corpus is specified either by a set of patterns, or by a reference to another corpus (via source_corpus_dir). The documents specified by a set of patterns are randomly reordered before any subsequent processing is performed (e.g., split, preprocess). If source_corpus_dir is present, patterns are ignored.

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of the corpus, for subsequent reference in the remainder of the experiment. It is also used as the name of the subdirectory in which this corpus is built, if the corpus is either local (i.e., it has patterns), or remote with processing overrides.
source_corpus_dir
a string
no
If present, a pathname of an existing corpus directory. If the path is not an absolute path, the experiment directory will be prepended.

The corpus found in this directory will be used as the input to further local processing. If present, the <pattern> children are ignored. Source corpora can themselves have source_corpus_dir attributes; in other words, you can create chains of source corpora. If the current corpus is in a <corpora> tag that has a <prep> tag, the local <prep> tag command line will be applied to the output of the source corpus (so you can chain prep commands if you want). The most local <partition> attributes will be used (that is, the attributes closest to this corpus in the source corpus chain).

Since corpora are created and loaded in the order they're listed in an experiment file, you can use source_corpus_dir to point to a corpus in the same experiment file. The path would be [experiment_dir]/corpora/[corpus_name], if the "dir" attribute is not set on the <corpora> tag which dominates the corpus you're referring to; if it is, the path would be [corpora_dir_attribute_value]/[corpus_name].

Children

Element
Obligatory?
Repeatable?
Description
<pattern>
yes
yes
A glob-style pattern of files to use to construct this corpus. "Glob" style is the UNIX shell file pattern matching; e.g., "*" matches everything. (This is in contrast to standard regular expressions.) If this path pattern isn't an absolute path, the --pattern_dir option of MATExperimentEngine must be used to provide the location of the patterns.

This element has no attributes or element children; its value is the text it delimits.

<prep> (of <corpora>)

This element houses the arguments to the MATEngine command to use to preprocess the corpora. You might use this command to take documents which have been deidentified and resynthesize fillers for the deidentified regions.

Attributes

Attribute
Value
Obligatory?
Description
<attr>
a string
no
An attribute-value pair which corresponds to a command-line option to MATEngine.

The input_file_type attribute is not provided automatically and must be provided as one of the attributes for this element. The workflow attribute must also be specified. The following attributes are provided by the experiment engine and should not be specified in <prep>: output_file_type input_dir output_dir task . The attributes should also not provide any other arguments which would further specify the input or output files.

If your documents are not in MAT JSON format, but another format that MAT understands (e.g., XML inline), insert a <prep> specification which specifies the input_file_type attribute and omits any attribute values for steps or undo_through.

<workspace_corpora> (of <experiment>)

In order to run experiments against workspaces, we've introduced a new element which allows you to specify a collection of corpora to be drawn from a workspace, restricted by various dimensions of the workspace contents. Each <workspace_corpora> element establishes a context for a  workspace, by describing a subset of eligible workspace files, and within that context you can define <workspace_corpus> elements which are ultimately transformed into the same objects which the <corpus> elements correspond to.

Each workspace corpus set must have a single trainable step in the workspace's workflow which each document has reached. If the workflow has more than one trainable step, you must specify the "step" attribute.

Attributes

Attribute
Value
Obligatory?
Description
dir
a pathname
no
If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory.
workspace_dir
a pathname
yes
The directory root of the workspace.
step
a string
no
If present, a trainable step in the workflow associated with the workspace. Obligatory if the workspace has more than one trainable step. All documents which have at least reached this step will be included.
step_statuses
a string
no
If present, a comma-separated list of step_statuses. The default is "partially corrected,partially gold,gold,reconciled". Any document whose current step is the target trainable step which doesn't have one of these statuses in the workspace will be excluded.
users
a string
no
If present, a comma-separated list of workspace users. Any document which is assigned to a user which isn't one of these users will be excluded.
include_unassigned
"no"
no
If present, documents which are not assigned to any workspace user will be excluded.
basename_sets
a string
no
If present, a comma-separated list of workspace basename sets. Any document which is not in one of these basename sets in the given workspace will be excluded.
basename_patterns
a string
no
If present, a comma-separated list of glob-style patterns to match the workspace basenames against. Any document whose basename does not match one of the patterns will be excluded.

Children

Element
Obligatory?
Repeatable?
Description
<workspace_corpus>
yes
yes
An actual corpus which will be created in this workspace context.

<workspace_corpus> (of <workspace_corpora>)

Each <workspace_corpus> element defines a set of files, similar to a <corpus> element. Because its common context is a workspace, we've chosen to allow the partition, crossvalidation and size information to be specified independently for each of these elements, rather than establishing them in the common workspace context (as <corpora> does for <corpus>). Within the workspace context, you can further restrict each corpus in the same way as you restrict the context itself, and also identify a special, unique corpus as containing the remainder of the files in the workspace context. Note that you can't further filter the workspace's workflow step; this is specifiable only in the parent element.

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of the corpus, for subsequent reference in the remainder of the experiment. It is also used as the name of the subdirectory in which this corpus is built.
step_statuses
a string
no
If present, a comma-separated list of step_ statuses. Any document in the workspace context whose current step is the target trainable step which doesn't have one of these statuses will be excluded.
users
a string
no
If present, a comma-separated list of workspace users. Any document in the workspace context which is assigned to a user which isn't one of these users will be excluded.
include_unassigned
"no"
no
If present, documents in the workspace context which are not assigned to any workspace user will be excluded.
basename_sets
a string
no
If present, a comma-separated list of workspace basename sets. Any document in the workspace context which is not in one of these basename sets will be excluded.
basename_patterns
a string
no
If present, a comma-separated list of glob-style patterns to match the workspace basenames against. Any document in the workspace context whose basename does not match one of the patterns will be excluded.
use_remainder
"yes"
no
If present, the corpus consists of all those documents in the document context which are not included in any sibling <workspace_corpus>. This attribute-value pair must occur without any of the other qualifying attributes.

Children

The semantics of these elements are identical to the semantics of these elements in the scope of the <corpora> element.

Element
Obligatory?
Repeatable?
Description
<partition>
no
yes
The proportional partition settings for this workspace corpus. If neither this nor <fixed_partition> is present, the corpus will not have any partitions. You cannot mix fixed and proportional partitions and crossvalidation within a given <workspace_corpus> element.
<fixed_partition>
no
yes
The fixed partition settings for this workspace corpus. If neither this nor <partition> is present, the corpus will not have any partitions. You cannot mix fixed and proportional partitions and crossvalidation within a given <workspace_corpus> element.
<crossvalidation>
no
no
The crossvalidation settings for this workspace corpus. If this element is present, the corpus will be prepared for crossvalidation. You cannot mix fixed and proportional partitions and crossvalidation within a given <workspace_corpus> element.
<size>
no
no
The size settings for this workspace corpus.

<partition> (of <workspace_corpus>)

See <partition> of <corpora> for details.

<fixed_partition> (of <workspace_corpus>)

See <fixed_partition> of <corpora> for details.

<crossvalidation> (of <workspace_corpus>)

See <crossvalidation> of <corpora> for details.

<size> (of <workspace_corpus>)

See <size> of <corpora> for details.

<model_sets> (of <experiment>)

Each experiment also can contain a number of model sets. A model set is a sequence of models built out of the same corpus, with successively larger numbers of training inputs. This iterative capability does not have to be used, but is available if the user wants to track the change in performance relative to the number of training documents.

Attributes

Attribute
Value
Obligatory?
Description
dir
a pathname
no
If present, the directory where the model sets can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "model_sets" in the experiment directory.

Children

Element
Obligatory?
Repeatable?
Description
<build_settings>
no
no
The instructions for building the model sets in this bundle.
<model_set>
yes
yes
A model set.

<build_settings> (of <model_sets>)

In order to run an experiment, you must have declared  the appropriate <model_config> in your task.xml file.

While the training engine you're most likely to use is the jCarafe engine, your task may have multiple trainable steps and multiple trainable engines, and you'll need to ensure that the proper one is chosen. If you have multiple trainable steps, you must specify the trainable step explicitly using the "trainable_step" attribute.

Attributes

Attribute
Value
Obligatory?
Description
config_name
a string
no
By default, the settings here will override the attribute values for the default model build settings in task.xml. If this attribute is present, the experiment engine will look for the model build settings with the specified config_name.
trainable_step
a string
no
The name of a step in your task.xml file which has an engine which is trainable. Specify this attribute if there are multiple trainable steps in your task. The model builder will infer the annotation sets to train for from this step.
<attr>
a string
no
An attribute-value pair which overrides or sets the attribute values for your chosen model configuration.

Children

Element
Obligatory?
Repeatable?
Description
<iterator>
no
yes
An iterator which can be applied to create a sequence of models for these model sets.

<iterator> (of <build_settings>)

It is possible to iterate through a set of values for the model builder using an iterator. For instance, the default jCarafe engine allows you to customize the degree of L1 regularization (see the jCarafe documentation for details). You might want to build a series of models exploring the effects of L1 regularization values from 0.0 through 2.0, at increments of .2. Or, you might want to vary the number of training iterations the engine performs from 10 to 150 at increments of 10.

You can specify multiple iterators, and you'll get the cross-product of the settings. The iterator mechanism is flexible enough that you can build iterators which depend on the last model built, or iterators which specify their own model builder class (if, for example, they need to do some extensive computation on the training corpus before they train).

Corpus setting iterators and build setting iterators are both applied to the model, and the cross-product of the possible values is used. The corpus setting iterators are applied first.

The build settings support two built-in iterators, which can be configured using the <iterator> element.

Attributes

Attribute
Value
Obligatory?
Description
type
a string
yes
Either one of the two predefined iterator types "value" or "increment", or the name of an iterator class defined in your task's Python library.
<attr>
a string
no
An attribute-value pair which configures the given iterator. The available attributes and values for the two predefined iterator types is listed immediately below.

Here are the attributes and values available for the "value" iterator:

Attribute
Value
Obligatory?
Description
attribute
a string
yes
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <build_settings> element above or in the <model_config> element in task.xml.
values
a string
yes
A comma-delimited set of values to iterator over.
value_type
one of "float", "str", "int"
no
The type of the values, either strings, integers, or floats. Default is "str" (string).

Here are the attributes and values available for the "increment" iterator:

Attribute
Value
Obligatory?
Description
attribute
a string
yes
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <build_settings> element above or in the <model_config> element in task.xml.
start_val
an integer or float
yes
The initial value for the incrementer.
end_val
an integer or float
yes
The final value for the incrementer.
increment
an integer or float
yes
The increment to add to value on each iteration.
force_last
"yes"
no
If present, force the last value to be processed, even if it's not exactly an increment. For instance, if you're incrementing model iterations from 20 to 150 by increments of 20, the last value processed will be 140, unless you provide this setting. This setting is also useful with float values, due to the way programming languages like Python deal with floats; asking to increment from .1 to .5 by .1 may or may not give you exactly .5 as the final value, so you might want to use this setting to force whatever value it is (e.g., .500000000001) to be processed.

<corpus_settings> (of <model_sets>)

In addition to the model builder itself, you can configure properties of the training corpus as well. At the moment, the only property of the training corpus you can configure is its size, and this is mostly in service of the iterator over corpus size. When  you specify the size of a corpus, it will truncate the training corpus to the specified length.

Attributes

Attribute
Value
Obligatory?
Description
size
an integer
no
By default, the corpus size is defined in the <corpora> element. However, you can further specify the size here, if you want the corpus to be even smaller than what's specified in <corpora>, or if your training corpus is a union of a number of different corpora (see the <training_corpus> element below).

Children

Element
Obligatory?
Repeatable?
Description
<iterator>
no
yes
An iterator which can be applied to create a sequence of corpora for these model sets.

<iterator> (of <corpus_settings>)

It is possible to iterate through a set of values for the model corpus using an iterator. Right now, the only available iterator is the "corpus_size" iterator (although you can also define your own if you need to).

Corpus setting iterators and build setting iterators are both applied to the model, and the cross-product of the possible values is used. The corpus setting iterators are applied first.

Attributes

Attribute
Value
Obligatory?
Description
type
a string
yes
Either the predefined iterator type "corpus_size", or the name of an iterator class defined in your task's Python library.
<attr>
a string
no
An attribute-value pair which configures the given iterator. The available attributes and values for the "corpus_size" iterator type is listed immediately below.

Here are the attributes and values available for the "corpus_size" iterator:

Attribute
Value
Obligatory?
Description
start_val
an integer
no
The initial value for the corpus size. Defaults to the increment.
end_val
an integer
no
The final value for the corpus size. Defaults to the size of the corpus.
increment
an integer
yes
The corpus size increment for each iteration.
force_last
"yes"
no
If present, force the last value to be processed, even if it's not exactly an increment. So if the corpus has 176 documents in it, and you've specified an increment of 20, the last corpus size that will be processed is 160, unless this option is specified.

<model_set> (of <model_sets>)

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of this model set, for subsequent reference in this experiment. It is also used as the name of the subdirectory in which this model set is built.

Children

Element
Obligatory?
Repeatable?
Description
<training_corpus>
yes
yes
One or more corpora (and possibly partitions of corpora) which should be used to construct this model set.

<training_corpus> (of <model_set>)

May be repeated. Specifies the training corpora to use in building this model set. Each corpus is referred to by name, and an optional partition name. Note that the model will not use a document path more than once, so if you specify multiple corpora, and the corpora overlap, only one instance of each path will be used.

If the corpus is a crossvalidation corpus, there can be only one training corpus, and its partitions may not be referred to. In addition, you may not specify the corpus size in the corpus settings, or use the corpus size iterator.

Attributes

Attribute
Value
Obligatory?
Description
corpus
a string
yes
The name of the training corpus to use. This name must match the "name" attribute of some <corpus> element in the experiment file.
partition
a string
no
If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used.

<runs> (of <experiment>)

The experiment also can have a set of runs. The runs in each <runs> element share a set of run settings. Whenever the experiment is run, each <run> is scored, whether or not it's been scored before. This is a convenient way of reviewing the scores after an experiment is finished.

The scores produced by the experiment engine reflect only the annotations created, and attributes set, by the step which the model for the run was built for. It is equivalent to passing the annotation sets for that step as the --content_annotation_sets option to MATScore.

Attributes

Attribute
Value
Obligatory?
Description
dir
a pathname
no
If present, the directory where the runs can be found or built. If the directory does not exist, it will be created. The default value for this attribute is a subdirectory named "runs" in the experiment directory.

Children

Element
Obligatory?
Repeatable?
Description
<run_settings>
yes
no
A container for the arguments to run the processing engine with.
<run>
yes
yes
An experimental run.

<run_settings> (of <runs>)

Children

Element
Obligatory?
Repeatable?
Description
<args>
yes
no
The arguments to the MATEngine to use for these experiment runs.
<prep_args>
no
no
The arguments to the MATEngine to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.
<score_args>
no
no
Flags which control the behavior of the scorer.

<args> (of <run_settings>)

This element houses the arguments to the MATEngine command to perform the experiment runs.

Attributes

Attribute
Value
Obligatory?
Description
<attr>
a string
no
An attribute-value pair which corresponds to a command-line option to MATEngine.

The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <args>: input_file_type output_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files.

<prep_args> (of <run_settings>)

This element houses the arguments to the MATEngine command to to use to prepare the annotated documents for the experiment runs. By default, the documents are converted to raw documents, but if instead you want to just undo a step and leave them as MAT JSON documents, you can use this element to achieve that.

Attributes

Attribute
Value
Obligatory?
Description
<attr>
a string
no
An attribute-value pair which corresponds to a command-line option to MATEngine.

The output_file_type attribute must be specified (you're restricted to mat-json and raw).

The workflow attribute must be specified. The following attributes are provided by the experiment engine and should not be specified in <prep_args>: input_file_type input_dir output_dir output_fsuff task . The attributes should also not provide any other arguments which would further specify the input or output files.

<score_args> (of <run_settings>)

These flags control the behavior of the scorer. They are not yet generally processed the way <prep_args> and <args> are; they're individually defined and handled, for the moment.

Attributes

Attribute
Value
Obligatory?
Description
gold_only
"yes"
no
Equivalent to the --ref_gold_only option of MATScore. If this attribute is provided, only gold or reconciled segments in the reference will be used for scoring comparison. This is particularly useful when defining experiment files to be used with workspaces.
similarity_profile
a string
no
Equivalent to the --similarity_profile option of MATScore.
score_profile
a string
no
Equivalent to the --score_profile option of MATScore.
score_steps
a comma-separated string
no
If no model attribute is provided to the <run> element below, you must provide a comma-separated list of task steps to inform the scorer which steps are being evaluated.

<iterator> (of <run_settings>)

It is possible to iterate through a set of values for the run using an iterator. For instance, the default jCarafe engine allows you to customize the recall/precision bias (see the jCarafe documentation for details). You might want to build a series of models exploring the effects of recall/precision bias values from -2.0 through 2.0, at increments of .5.

You can specify multiple iterators, and you'll get the cross-product of the settings. The iterator mechanism is flexible enough that you can build your own iterators if you need them.

The run settings support two built-in iterators, which can be configured using the <iterator> element.

Attributes

Attribute
Value
Obligatory?
Description
type
a string
yes
Either one of the two predefined iterator types "value" or "increment", or the name of an iterator class defined in your task's Python library.
<attr>
a string
no
An attribute-value pair which configures the given iterator. The available attributes and values for the two predefined iterator types is listed immediately below.

Here are the attributes and values available for the "value" iterator:

Attribute
Value
Obligatory?
Description
attribute
a string
yes
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <run_settings> element above or in the <run_settings> element in workflow steps in task.xml.
values
a string
yes
A comma-delimited set of values to iterator over.
value_type
one of "float", "str", "int"
no
The type of the values, either strings, integers, or floats. Default is "str" (string).

Here are the attributes and values available for the "increment" iterator:

Attribute
Value
Obligatory?
Description
attribute
a string
yes
The name of the training engine attribute you're iterating over. This attribute would be one that could be set in the <run_settings> element above or in the <run_settings> element in for workflow steps task.xml.
start_val
an integer or float
yes
The initial value for the incrementer.
end_val
an integer or float
yes
The final value for the incrementer.
increment
an integer or float
yes
The increment to add to value on each iteration.
force_last
"yes"
no
If present, force the last value to be processed, even if it's not exactly an increment. For instance, if you're incrementing recall/precision bias from -2.0 to 2.2 by increments of .5, the last value processed will be 2.0, unless you provide this setting. This setting is also useful with float values even if it appears that the endpoints match the increment precisely, due to the way programming languages like Python deal with floats; asking to increment from .1 to .5 by .1 may or may not give you exactly .5 as the final value, so you might want to use this setting to force whatever value it is (e.g., .500000000001) to be processed.

<run> (of <runs>)

Attributes

Attribute
Value
Obligatory?
Description
name
a string
yes
The name of this experimental run. This is used as the name of the subdirectory in which this run is conducted.
model
a string
no
A comma-separated sequence of names of models to use. Each name must match the "name" value of some <model_set> element in the experiment file. If there are multiple names, the corresponding model sets must all have been trained on the same corpus and partition. If no models are provided, then the score_steps attribute of <score_args> must be specified in order to inform the scorer which steps are being evaluated.

Children

Element
Obligatory?
Repeatable?
Description
<test_corpus>
no
yes
One or more test corpora (and possibly partitions of corpora) to use in this run. This element is almost always required, except when the models are crossvalidation models, in which case it is forbidden.

<test_corpus> (of <run>)

May be repeated. One or more test corpora (and possibly partitions of corpora) to use in this run.

Attributes

Attribute
Value
Obligatory?
Description
corpus
a string
yes
The name of the test corpus to use. This string must match the "name" value of some <corpus> element in the experiment file.
partition
a string
no
If present, the name of a partition in the specified corpus, which must match the "name" attribute of some <partition> element in the corpus. If not present, the entire corpus will be used.