If you've received a previous version of MAT, this page contains
instructions on how to upgrade to the new version.
Version 3.3 is almost completely backward compatible with version
3.2. Its primary goal is to future-proof MAT.
In order to make the core of MAT compatible with pip, the Python
library has been split in two. The vast majority of the library
remains in src/MAT/lib/mat/python/MAT, and the remainder,
dedicated specifically to jCarafe bindings and the MAT Web server,
is found in src/MAT/lib/mat/python/MATApplication. Developers will
almost never need this latter library.
MATUpdateTaskXML and MATUpdateWorkspace2To3 have been removed.
They were both for use in migrating from MAT 2.x to MAT 3.x. MAT
3.x is now five years old, and we no longer test these conversion
utilities. if you haven't migrated from MAT 2.x, you'll have to
download MAT 3.1 or MAT 3.2 and use the conversion utilities in
those releases.
Previous to 3.3, there was a variable in the ExecutionContext
module called _SUBPROCESS_DEBUG which was used internally to
control debug levels for subprocess calls. It has been removed in
favor of a lazier way of accessing the value which respects live
changes to the MAT configuration.
Previous to 3.3, in most situations the XML text and attribute
values in the task.xml file were required to be ASCII-compatible;
the only exception was the tokenless_autotag_delimiters attribute.
This restriction was always unnecessary, and the migration to
Python 3 has compelled its removal. All such text and values in
task.xml in MAT 3.3 can take advantage of all of Unicode (with the
caveat that certain nonlinguistic Unicode characters might cause problems).
The new Unicode issues discussion
describes the basic problem, which has lurked within MAT for a
very long time, but will be exacerbated by the differences between
Python 2 and Python 3. If you have annotated documents which
contain characters outside the Unicode Basic Multilingual Plane,
you will likely not be able to use MAT.
Prior to 3.3, any boolean attributes were automatically provided
with keyboard accelerators because there was no way of specifying
them. In 3.3, the <positive> and <negative>
sub-elements of the <boolean> attribute declaration in the
new-style annotation set declarations now provide hooks for
declaring these accelerators explicitly. This feature has not been
backported to the legacy declaration method, so in 3.3, if you
want keyboard accelerators for boolean attributes, you must
upgrade to the new-style declaration format.
If you use the MAT standalone viewer, and you've embedded a JSON
task description in your application, or if you use the MATAnnotationInfoToJSON
utility to generate the JSON task description for this or any
other reason, this representation has changed a tiny bit in
version 3.3; in conjunction with the previously-described change
in the declaration of boolean attribute keyboard accelerators, the
JSON representation of accelerators for all attributes has been
changed. If you refer to attribute accelerators, or have boolean
attributes in your task, you must regenerate the JSON.
Version 3.2 is almost completely backward compatible with version
3.1. There are a number of new features, and a few
non-backward-compatible changes. The most substantial of these is
that we've upgraded the version of jCarafe.
We've updated MAT to the very latest version of jCarafe. You'll
need to rebuild your models.
In MAT 3.0, the internal structure of workspaces was completely
reorganized to fully support workflows more transparently.
However, we didn't fully update the way documents in workspaces
were assigned to users; unassigned documents could illegitimately
be assigned to users after they'd been modified if they had just
advanced to a new workspace step, and previously assigned
documents couldn't be assigned at all, because that portion of the
code hadn't been updated. In MAT 3.2, we've fixed these errors by
extending one of the workspace database tables to keep track of
which imported workspace files are entirely pristine. When you
open a pre-3.2 workspace in MAT 3.2, the workspace database will
be updated automatically, and it will no longer be useable in MAT
versions before 3.2.
The getDocument() method of the viewer API has been modified to
match the behavior of other document panels. This was necessary in
order to support annotation tables in the standalone UI. To
retrieve the "bare" document which is equivalent to the previous
result of getDocument(), use getBareDocument().
In previous versions, a swipe in tokenized text in the UI
expanded to the nearest token boundaries on the left and right,
but did not contract to the nearest boundaries (i.e., if the user
had selected peripheral whitespace). We regard this as a bug, and
it has been fixed.
The MATManagePluginDirs
utility is now more thorough in how it inspects tasks during
validation and installation. As a result, it will now refuse to
install tasks whose Python customizations raise import errors
during installation, even if these errors would not be raised
during normal execution. Previously, installing such a task would
report the errors, but still succeed in installing the task, which
could be confusing to users.
Previous versions of MAT were optionally distributed with tabbed
terminal packages to provide a tabbed view of the MAT Web server,
where the various server logs were displayed separately from the
Web server command loop. These tabbed terminal packages were old,
and we had not tested MAT with them in a very long time. In
version 3.2, we've stopped distributing these packages. Instead,
the --spawn_tabbed_terminal option of MATWeb now accepts an
argument which is a 4-argument command which the user can provide
to run his or her own tabbed terminal of choice. We've provided an
example of such a command for the Unix GNOME windowing package in
web/examples/gnome_tabbed_web_server_terminal.sh.
For some reason, annotation offsets were not often recorded in
the UI log. Although we intend the logs to remain anonymous, this
information doesn't really compromise the anonymity; little can be
gleaned from the location of an annotation gesture beyond the
minimum length of the file being annotated. Furthermore, this
information was already being recorded for annotation gestures
which modify the extent of an annotation. In 3.2, if a gesture
affects an annotation, the offsets are recorded in the UI log.
In addition, the gesture_type column in the UI log has been
changed to gesture_method, and the modify_annotation event has
been changed to modify_attribute. The remove_annotation_failed
action has been changed to remove_annotation_failure, and the
parameters have been changed to align with other _failure actions.
A new select_tab action has been added to the UI log.
Reconciliation logging has been completely reorganized. All
annotation creation and modification gestures are now captured
normally in the log, as well as vote actions.
The buttons in the reconciliation
table have been renamed and reorganized.
The third-party package which supported the
--subprocess_statistics common command-line option only ever
worked on Linux, and was rarely included in the distribution. It
has been removed for simplicity and future maintainability.
Third-party dependencies are no longer distributed unzipped in
the src/ subdirectory; they're now found in the third_party
directory, as zips, and unpacked in third_party/install during the
installation process. The MANIFEST file has also been removed; the
information is now inferred from the distribution during the
installation process. This change should be largely invisible to
the user; the only consequence is that the MAT 3.2 redistribute.py utility can only
now be used with MAT 3.2 or later, and the version of the utility
distributed in previous MAT releases can only be used with
releases previous to MAT 3.2.
In order to support the expanded task configuration features,
we've completely reorganized the task.xml file. We've introduced
the concept of an engine; completely reworked the way workspaces
are configured and built; and localized optional and obligatory
pretagging in the definition of steps. You can read more about the
new task organization here
and here. The MATUpdateTaskXML tool is
designed to do most, if not all, of this work for you (and should
tell you what it can't do as it tries to do it). Your first step
in updating to 3.0 should be to run this tool before installing
your (updated) task.
One of the confounding aspects of the 2.0 implementation of
workflows and steps was that the task required a global "undo"
order, in order to ensure that when steps were undone in a
workflow, all appropriate steps in the task were undone (e.g., if
you undid tag and zone in a workflow which doesn't contain
tokenization, and some workflow in the task contained tokenization
between zone and tag, tokenization was undone as well). This
global undo order was impossible to maintain in 3.0, and as a
result, it has been abandoned. If you undo steps in a workflow,
only those steps will be undone, and as a result, your documents
can end up in unusual states (e.g., tokenized but not zoned). In
order to compensate for this issue, in 3.0, workflows, by default,
are not undoable; the "retreat" buttons in the UI will not be
present, for instance. You can specify workflows as undoable using
the new "undoable" attribute of the <workflow> element in
your task.xml file, but we encourage you to use it sparingly; you
should only enable this feature for workflows which support all
your tagging steps (content and otherwise).
Workspaces now explicitly record the workflow they apply and the
language they're supporting. In addition, workspaces have a new
"review" folder, for human review, and the contents and metadata
associated with reconciliation folders are completely different.
As a result, your 2.0 workspaces must be updated to 3.0 (after
you've updated your task) using the MATUpdateWorkspace2To3
tool.
In 2.0, MAT had a demo configuration capability, which we decided
we couldn't afford to maintain. It's been removed in 3.0.
In 2.0, file mode featured two buttons: a forward arrow to
complete the current step, and a backward arrow to undo the
current step. In 3.0, we've clarified and sorted out what these
operations do, and how they interact with marking gold. MAT now
features four types of annotation steps, which are specified in
your task.xml file:
The available advance/retreat buttons vary depending on which
step you're currently in. A hand icon refers to hand annotation, a
gear icon refers to automated annotation, a right arrow refers to
marking the step gold, and a left arrow indicates retreating;
coupled with a hand icon, the left arrow indicates retreating into
the most recent hand annotation phase, rather than undoing the
current or previous step. You can find more details here.
In 2.0, MAT introduced SEGMENT annotations, an administrative
annotation type which tracks annotation progress. These SEGMENT
annotations referred to the document in general; e.g., a "human
gold" status indicated that the document was marked gold. In 3.0,
these statuses refer to particular annotation sets; so when you
mark a document gold in an annotation step (and you can have
multiple hand-annotatable steps), you're marking gold the sets
associated with the step. In other words, MAT is now tracking
(properly, we think) the annotation status of your annotation
sets. To support this, each SEGMENT annotation has a "set"
attribute, indicating what annotation set it refers to.
The change to SEGMENT annotations is only one way the
administrative information in MAT-JSON documents has changed.
Another way is that MAT-JSON documents now record their overall
progress in terms of the annotation sets that have been added,
rather than the workflow steps that have been applied. When 2.0
MAT-JSON documents are read into a 3.0 tool, all this
administrative information is automatically updated. As a result,
documents saved in 3.0 cannot be used in 2.0.
As part of the various 3.0 changes, the options to MATEngine have
changed slightly:
As part of the change in administrative information, 2.0
administrative information that can't be updated when the document
is read is discarded. If, for instance, you've edited your task to
split up your content annotations into multiple sets (say, span
annotations and relation annotations), the SEGMENT statuses can't
be updated consistenly, and "human gold" or "reconciled" statuses
will be discarded. You can use the new --mark_gold and
--mark_reconciled options of MATEngine to fix this.
Because workflows can now have multiple content annotation steps,
the --tagger_local and --tagger_model flags have been replaced.
These flags can now be specified as --<step_name>_local and
--<step_name>_model, where <step_name> is the name of
the step. E.g., if one of your tag steps is named "carafe_tag",
these flags would be --carafe_tag_local and --carafe_tag_model. In
the context of declaring <run_settings> in the context of a
<step> in your <workflow> element in task.xml, these
flags should be referenced as "local" and "model". The rule of
thumb is: in the context of the engines, a prefix is required; in
the context of a step, it is forbidden.
Because you can now use jCarafe
as a trainable tagger in multiple steps in your workflow, (almost)
all the option names associated with the jCarafe tagger are now
prefixed in the same way that "local" and "model" are. E.g., if
you want to change the recall/precision balance for the carafe_tag
step, you must now use --carafe_tag_prior_adjust instead of
--prior_adjust in the context of the workflow or engine, or refer
simple to "prior_adjust" in the context of declaring a workflow
step.
Before MAT 2.0, workspaces had a "prep" operation, and "import"
operation and an "autotag" operation. In 2.0, we removed the
"prep" operation and folded it into the "import" operation. In
3.0, any human-annotatable workflow can serve as the basis for a
workspace, and the goal of the workspace is to advance
automatically to the next human-annotatable point. As a result:
The workspaces now also feature a number of new operations to
manage reconciliation and review, which you can learn about here.
In order to support the various new workflow and workspace
features, the following elements of the experiment XML have changed:
Unless you've used the model_class or <workspace_corpora>
features in your experiment, you should not notice these changes
in moving from 2.0 to 3.0.
When you fill the value of an annotation attribute in the UI, you
have new, more streamlined options in 3.0. First, we've introduced
the idea of an "active" annotation editor, if multiple annotation
editor windows are open; from the annotation popup menu, you can
now add annotations you click on as attribute values in the active
annotation editor without returning to the editor window. Second,
if your attribute is a set or a list, you can add multiple values
without re-enabling the attribute for filling for each value.
In 2.0, the default location for spanless annotations without any
annotation-valued attributes of their own was at the top of the
document. This placement turned out to be problematic for certain
spanless annotations (e.g., those representing implicit argument
fillers). In 3.0, spanless annotations which have no implicit span
information, but are attributes of elements with implicit span
information, are positioned next to the elements which point to
them.
A few of the log entry names have been changed, and a number of
obsolete entries have been removed.
The distinguishing_attribute_for_equality attribute in the task.xml file was used pre-2.0 as an input to scoring, and in 2.0 as an input to reconciliation. In 3.0, it's been completely superseded by the similarity configurations, and has been removed.
Support for Cygwin has been removed, because Python in Cygwin
does not support sqlite, and sqlite is required for the MAT
workspaces in 2.0. Migrate to Windows native.
Because MAT now explicitly defines the annotations and
well-formedness conditions for attributes separately from its
display information, the task.xml file has been reorganized. You
can use the MATUpdateTaskXML
tool to update your task.xml file automatically.
The version of jCarafe which is delivered with MAT 2.0 is
0.9.8.5.b-06, which has a different model structure than the
version delivered with 1.3. You must rebuild all your models,
either using MATModelBuilder
(in file mode) or the "modelbuild" operation of MATWorkspaceEngine (in
workspace mode).
The 1.3 UI used a desktop-in-a-browser metaphor, which raised a
number of issues, including poor use of screen real estate. In
2.0, we've completely reorganized the UI, and changed the URL.
In previous releases, you really didn't have the option to pass
any command-line options to the MATWeb server running under the
tabbed terminal. As the command-line options to MATWeb expanded,
and became more important, this turned out to be a bad idea. As a
result, we've now reorganized the tabbed terminal startup so that
it's part of MATWeb. The mat_controller.sh application is gone.
The Windows mat_controller.bat script is still present, but it
simply invokes MATWeb with the --spawn_tabbed_terminal option.
We have completely reorganized the internal structure of workspaces for 2.0. These new
workspaces are more powerful and impose fewer requirements on the
user. Your MAT 1.3 workspaces cannot be used with MAT 2.0 without
modification. We've provided an upgrade tool which will
allow you to convert your MAT 1.3 workspaces to MAT 2.0.
The new workspaces feature many fewer folders; a SQLite database
which manages the document state information; real transaction and
file locking; document assignment, potentially to multiple
annotators; extensive logging capabilities; and infrastructure for
future capabilities like reconciliation and complex reconciliation
workflows, prioritization queues, and segment-by-segment
annotation.
As a result of this change, it's no longer possible to run an
experiment against a workspace by pointing to, e.g., the
"completed" folder. So as part of this change, there's now special
support for running experiments against workspaces, both from MATWorkspaceEngine and MATExperimentEngine.
MATScore and MATExperimentEngine have
long supported writing one of three CSV file formats (Excel
formulas, OpenOffice formulas, and no formulas). In 2.0, you can
now write multiple formats in the same run, and the name of each
CSV file clearly indicates the formula type. As a result, the
--no_csv_formulas and --oo_separator command-line options have
been removed, and replaced with --csv_formula_output.
Because the scorer now provides mismatch details for all
conditions, this flag has been renamed to
--tag_output_mismatch_details.
Due to enhancements to the scorer, some of the columns in the
output spreadsheets have been renamed or moved, and others have a
slightly different interpretation. Full details here.
In previous releases, we deprecated, but retained, the "operate"
operation in MATWorkspaceEngine. This operation has finally been
removed in 2.0. If you had still been doing something like this:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory operate core modelbuild
you should now do this:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory modelbuild core
See the workspace
documentation for more details.
In version 1.2, the text_right_to_left attribute lived on the
workflow element in task.xml; we anticipated that different
workflows might be used for different languages within the same
task. Since then, we've realized that the task is going to be the
appropriate level of encapsulation for language differences for
the foreseeable future. Furthermore, the current implementation of
right-to-left encoding did not work appropriately with workspaces.
Accordingly, we've moved this attribute to the web_customization
element, and it is now global to tasks.
The experiment engine has now been extended with general-purpose
iterators for sets of values and for value increments. So it's now
possible, for instance, to vary the number of model iterations
from 20 to 100 by increments of 10 without having to write a
separate model set specification for each possible value. These
iterators can be combined, in which case you'll get the
cross-product of the possible value settings, or you can define
your own iterators to get more sophisticated behavior (e.g.,
iterating over pairs of attribue-value sets). For the user, this
means that a couple of attributes have been removed from the
experiment engine, and a new set of elements and attributes has
been added.
In version 1.2, all you could iterate on was corpus size. The
mechanism for this iteration has now changed. In version 1.2, this
is what you'd do:
[...]
<model_sets dir="model_sets">
<build_settings training_increment="4"
truncate_to_increment="yes"/>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]
In version 1.3, it looks like this instead:
[...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]
You can see that the size processing has been removed from the
<build_settings> and added to a new <corpus_settings>
element, which contains an instance of the new <iterator>
element to specify the type of the iteration. See the documentation and examples for the
experiment engine for more details. Note that in version 1.2, you
had to specify explicitly that the iteration ends on an increment
exactly; in 1.3 this is the default, and to force the final corpus
size to be used, you'll need the force_last attribute:
[...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4" force_last="yes"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]
The experiment engine output spreadsheets have been slightly expanded to include information about the run and model "families" in addition to the actual run and model. This change follows from the introduction of general iterators described above. See the documentation on MATExperimentEngine for details.
In order to support the iterators in the experiment engine, we've
reorganized the structure of the experiment directory somewhat.
See the documentation on MATExperimentEngine
for details.
It is now possible to run MAT in Windows without Cygwin
installed.
Unlike previous versions, there is a single distribution bundle
for MAT 1.2 for all supported platforms. For compatibility with
Windows, this bundle is now a zip file.
If you use mat_controller.sh or mat_controller.bat under Windows,
you'll find that there's a new tabbed terminal tool we're using,
which has the advantage of not requiring Cygwin.
If you're using mat_controller.sh under MacOS X, and you intend
to install 10.6, note that the previous version of Terminator.app,
which supports the tabbed terminal behavior in mat_controller.sh,
will not work in 10.6; you must install the newer version provided
with MAT 1.2.
In version 1.2, the original OCaml tokenizer and Carafe
trainer/tagger have been replaced by the Java reimplementations.
There are a number of important changes that are required as a
result. Among other things, the Java tokenizer produces slightly
different token boundaries than the original OCaml tokenizer. This
is problematic because the entire basis of most annotation
systems, including MAT, is the subdivision into words (tokens). In
order to have optimal performance, the tokenization of documents
which are to be automatically tagged should match the tokenization
of the documents which were used to create the tagger model. This
means that in order to migrate from version 1.1 to version 1.2,
among other things, you must retokenize your documents and update
any references to the OCaml tokenizer.
First, to retokenize your documents, we've provided the new MATRetokenize utility. Please back
up your data before you run this utility.
Next, if you refer to a tokenization step implementation in your
task.xml file, you must change all
occurrences of MAT.PluginMgr.CarafeTokenizationStep to
MAT.JavaCarafe.CarafeTokenizationStep. You may also need to
specify the heap_size attribute on the relevant tokenization
<step> in any workflow, if it turns out that the
default Java heap size isn't large enough for your purposes (this
attribute can also be specified on the command line; see the Carafe engine documentation).
In version 1.2, the original OCaml tokenizer and Carafe
trainer/tagger have been replaced by the Java reimplementations.
There are a number of important changes that are required as a
result. Among other things, the model format for the Java engine
is completely different than the model format for the original
OCaml tokenizer. This means that you must rebuild all your models,
and update any references to the OCaml trainer/tagger.
First, retokenize your documents using MATRetokenize, as
described above, and update your tokenization steps.
Next, update your tagger and trainer settings in task.xml
according to the documentation provided for the Carafe engine.
Next, if you refer to a tagging step in your task.xml file, you must change all
occurrences of MAT.PluginMgr.CarafeTagStep to
MAT.JavaCarafe.CarafeTagStep. You may also need to specify the
heap_size attribute on the relevant tag <step> in any
workflow, if it turns out that the default Java heap size
isn't large enough for your purposes (this attribute can also be
specified on the command line; see the Carafe engine documentation).
Similarly, if you have a <model_build_settings> entry, you
must change all occurrences of
MAT.CarafeModelBuilder.CarafeModelBuilder to
MAT.JavaCarafe.CarafeModelBuilder, and possibly specify the
heap_size attribute as well. (Note below that you must also change
the syntax of <model_build_settings>.)
Note that for the tagger, the prior_adjst attribute has been
renamed to prior_adjust. For the trainer, the engine attribute has
been eliminated, and the feature_set attribute as well; there's
now a new feature_spec attribute which refers to a file in which
you can describe your feature set, if you don't want to use the
default feature set. Also, the psa_iterations flag has been
removed, due to more numerous options in the Carafe trainer;
psa_iterations="6"
becomes
training_method="psa" max_iterations="6"
Because PSA no longer requires random segments, the
no_random_psa_segments flag has been removed.
Finally, use the same tools as before to build your models: either MATModelBuild in file mode, or the modelbuild operation in workspace mode.
In order to support a more flexible way of specifying partitions in experiments, the way the configuration of experiments is cached has changed in version 1.2. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.1 to regenerate the experiment scores.
In order to support a more flexible way of specifying partitions
in experiments, we've changed the way partitions are specified in
the experiment XML files. We compare the relevant files below:
Version 1.1:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition split_fraction=".2" ctype="split"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test" corpus="test"/>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test" corpus="test"/>
</runs>
</experiment>
Version 1.2:
<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>
Note the following changes:
In order to clarify how task settings are handled in MAT, a
number of changes have been made to the task.xml
file syntax.
First, the <step> element of <step_implementations>
no longer accepts arbitrary attributes. If you made use of this
feature to pass settings to the initialization methods of workflow
steps, you must now use the <create_settings> child element.
We doubt that anyone has made use of this feature.
Second, the <step> element of <workflow> no longer
accepts arbitrary attributes. If you make use of this feature to
pass settings to workflow steps, you must now use the
<create_settings>, <ui_settings>, or
<run_settings> child elements. The most likely situation
where this might arise is in passing defaults to the run methods
of steps. For instance, if you used this feature to increase the
Java heap size for Java Carafe, your task.xml file would have to
be revised as follows.
Version 1.1:
...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag" heap_size="2G"/>
</workflow>
...
</workflows>
...
Version 1.2:
...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag">
<run_settings heap_size="2G"/>
</step>
</workflow>
...
</workflows>
...
Second, the way settings are specified for model configurations
has changed. The name and class for the configuration are now
separated from the settings which are passed to the model builder,
as follows.
Version 1.1:
...
<model_build_settings class="MAT.JavaCarafe.CarafeModelBuilder"
training_method="psa" max_iterations="6"/>
</model_build_settings>
...
Version 1.2:
...
<model_config class="MAT.JavaCarafe.CarafeModelBuilder">
<build_settings training_method="psa" max_iterations="6"/>
</model_config>
...
Finally, the <workflow> element no longer accepts arbitrary
settings; these settings must be passed using the
<ui_settings> child element. No task appears to use this
option yet, so this shouldn't affect anyone.
In order to support a more flexible way of invoking the MAT
engine in experiments, the way the configuration of experiments is
cached has changed in version 1.1. What this means is that you
will not be able to invoke MATExperimentEngine on experiment
directories created using version 1.0 to regenerate the experiment
scores.
In order to support a more flexible way of invoking the MAT
engine in experiments, we've changed the way corpus preprocessing
and test run processing are specified. In version 1.0, the MAT
engine was called as a command-line tool, and the options were
specified as a command line; in version 1.1, the options are
specified as XML attribute-value pairs. We compare the relevant
experiment XML blocks below:
Version 1.0:
<corpora dir="corpora">
<prep>--input_file_type xml-inline --workflow Align --steps 'zone,tokenize,align'</prep>
[...]
</corpora>
<runs dir="runs">
<run_settings>
<args>--steps zone,tokenize,tag --workflow Demo</args>
</run_settings>
[...]
</runs>
Version 1.1:
<corpora dir="corpora">
<prep input_file_type="xml-inline" workflow="Align" steps="zone,tokenize,align"/>
[...]
</corpora>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
[...]
</runs>
Version 1.1 adds the ability to define different training
engines. Because of this change, if you've defined your own task
and you specified model build settings in your task.xml file, you
must add a class attribute to the model_build_settings element.
This attribute is not optional, and there is no default. If you're
using the default Carafe engine, the value you should use for this
attribute is MAT.CarafeModelBuilder.CarafeModelBuilder, as in the
following example:
<model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder"
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
Version 1.1 adds the ability to import MAT JSON documents into
your workspaces which haven't yet been processed (as well as other
annotation formats, like XML inline). Because of this change, if
you have a workspace, you must add a directory to it. This
directory is expected by the MAT workspace engine. For each
workspace directory, do this:
% mkdir <workspace_dir>/folders/rich_incoming
In version 1.1, it's possible to have multiple model build
configurations in your task.xml file. In order to ensure that the
correct configuration adds the appropriate command line options to
the MATModelBuilder executable, it was necessary to introduce a
new restriction on the --task option for MATModelBuilder: if
it appears, it must now be the first command-line option. In other
words, the following will now raise an error:
% $MAT_PKG_HOME/bin/MATModelBuilder \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs --task "Named Entity" \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model
In version 1.0, the default model was defined within the model
build settings. In version 1.1, because of the presence of
multiple model bulid configurations, we've separated the
specification of the default model in task.xml.
Version 1.0:
<model_build_settings engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6" default_model="default_model"/>
Version 1.1:
<model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder"
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
<default_model>default_model</default_model>