Upgrade notes

If you've received a previous version of MAT, this page contains instructions on how to upgrade to the new version.

Upgrading from version 3.2 to version 3.3

Version 3.3 is almost completely backward compatible with version 3.2. Its primary goal is to future-proof MAT.

New features

The Python package structure has changed

In order to make the core of MAT compatible with pip, the Python library has been split in two. The vast majority of the library remains in src/MAT/lib/mat/python/MAT, and the remainder, dedicated specifically to jCarafe bindings and the MAT Web server, is found in src/MAT/lib/mat/python/MATApplication. Developers will almost never need this latter library.

Old utilities have been removed

MATUpdateTaskXML and MATUpdateWorkspace2To3 have been removed. They were both for use in migrating from MAT 2.x to MAT 3.x. MAT 3.x is now five years old, and we no longer test these conversion utilities. if you haven't migrated from MAT 2.x, you'll have to download MAT 3.1 or MAT 3.2 and use the conversion utilities in those releases.

Internal subprocess debug management has changed

Previous to 3.3, there was a variable in the ExecutionContext module called _SUBPROCESS_DEBUG which was used internally to control debug levels for subprocess calls. It has been removed in favor of a lazier way of accessing the value which respects live changes to the MAT configuration.

Attributes in task.xml are now Unicode

Previous to 3.3, in most situations the XML text and attribute values in the task.xml file were required to be ASCII-compatible; the only exception was the tokenless_autotag_delimiters attribute. This restriction was always unnecessary, and the migration to Python 3 has compelled its removal. All such text and values in task.xml in MAT 3.3 can take advantage of all of Unicode (with the caveat that certain nonlinguistic Unicode characters might cause problems).

Python 3 counts offsets differently than Python 2 for Unicode characters outside the Basic Multilingual Plane

The new Unicode issues discussion describes the basic problem, which has lurked within MAT for a very long time, but will be exacerbated by the differences between Python 2 and Python 3. If you have annotated documents which contain characters outside the Unicode Basic Multilingual Plane, you will likely not be able to use MAT.

Boolean attribute keyboard accelerators are no longer automatically declared

Prior to 3.3, any boolean attributes were automatically provided with keyboard accelerators because there was no way of specifying them. In 3.3, the <positive> and <negative> sub-elements of the <boolean> attribute declaration in the new-style annotation set declarations now provide hooks for declaring these accelerators explicitly. This feature has not been backported to the legacy declaration method, so in 3.3, if you want keyboard accelerators for boolean attributes, you must upgrade to the new-style declaration format.

The JSON representation of tasks has changed slightly

If you use the MAT standalone viewer, and you've embedded a JSON task description in your application, or if you use the MATAnnotationInfoToJSON utility to generate the JSON task description for this or any other reason, this representation has changed a tiny bit in version 3.3; in conjunction with the previously-described change in the declaration of boolean attribute keyboard accelerators, the JSON representation of accelerators for all attributes has been changed. If you refer to attribute accelerators, or have boolean attributes in your task, you must regenerate the JSON.

Upgrading from version 3.1 to version 3.2

Version 3.2 is almost completely backward compatible with version 3.1. There are a number of new features, and a few non-backward-compatible changes. The most substantial of these is that we've upgraded the version of jCarafe.

New features

jCarafe version has changed

We've updated MAT to the very latest version of jCarafe. You'll need to rebuild your models.

The workspace database has changed (a bit)

In MAT 3.0, the internal structure of workspaces was completely reorganized to fully support workflows more transparently. However, we didn't fully update the way documents in workspaces were assigned to users; unassigned documents could illegitimately be assigned to users after they'd been modified if they had just advanced to a new workspace step, and previously assigned documents couldn't be assigned at all, because that portion of the code hadn't been updated. In MAT 3.2, we've fixed these errors by extending one of the workspace database tables to keep track of which imported workspace files are entirely pristine. When you open a pre-3.2 workspace in MAT 3.2, the workspace database will be updated automatically, and it will no longer be useable in MAT versions before 3.2.

Standalone viewer API has changed

The getDocument() method of the viewer API has been modified to match the behavior of other document panels. This was necessary in order to support annotation tables in the standalone UI. To retrieve the "bare" document which is equivalent to the previous result of getDocument(), use getBareDocument().

Swiping in tokenized text has improved

In previous versions, a swipe in tokenized text in the UI expanded to the nearest token boundaries on the left and right, but did not contract to the nearest boundaries (i.e., if the user had selected peripheral whitespace). We regard this as a bug, and it has been fixed.

Task installation is more conservative

The MATManagePluginDirs utility is now more thorough in how it inspects tasks during validation and installation. As a result, it will now refuse to install tasks whose Python customizations raise import errors during installation, even if these errors would not be raised during normal execution. Previously, installing such a task would report the errors, but still succeed in installing the task, which could be confusing to users.

Tabbed terminal spawning in Web server has changed

Previous versions of MAT were optionally distributed with tabbed terminal packages to provide a tabbed view of the MAT Web server, where the various server logs were displayed separately from the Web server command loop. These tabbed terminal packages were old, and we had not tested MAT with them in a very long time. In version 3.2, we've stopped distributing these packages. Instead, the --spawn_tabbed_terminal option of MATWeb now accepts an argument which is a 4-argument command which the user can provide to run his or her own tabbed terminal of choice. We've provided an example of such a command for the Unix GNOME windowing package in web/examples/gnome_tabbed_web_server_terminal.sh.

Slightly improved, slightly changed UI logging

For some reason, annotation offsets were not often recorded in the UI log. Although we intend the logs to remain anonymous, this information doesn't really compromise the anonymity; little can be gleaned from the location of an annotation gesture beyond the minimum length of the file being annotated. Furthermore, this information was already being recorded for annotation gestures which modify the extent of an annotation. In 3.2, if a gesture affects an annotation, the offsets are recorded in the UI log.

In addition, the gesture_type column in the UI log has been changed to gesture_method, and the modify_annotation event has been changed to modify_attribute. The remove_annotation_failed action has been changed to remove_annotation_failure, and the parameters have been changed to align with other _failure actions.

A new select_tab action has been added to the UI log.

Reconciliation logging has been completely reorganized. All annotation creation and modification gestures are now captured normally in the log, as well as vote actions.

Reconciliation interface has changed slightly

The buttons in the reconciliation table have been renamed and reorganized.

--subprocess_statistics common option has been removed

The third-party package which supported the --subprocess_statistics common command-line option only ever worked on Linux, and was rarely included in the distribution. It has been removed for simplicity and future maintainability.

Distribution layout has changed slightly

Third-party dependencies are no longer distributed unzipped in the src/ subdirectory; they're now found in the third_party directory, as zips, and unpacked in third_party/install during the installation process. The MANIFEST file has also been removed; the information is now inferred from the distribution during the installation process. This change should be largely invisible to the user; the only consequence is that the MAT 3.2 redistribute.py utility can only now be used with MAT 3.2 or later, and the version of the utility distributed in previous MAT releases can only be used with releases previous to MAT 3.2.

Upgrading from version 3.0 to version 3.1

Version 3.1 is completely backward compatible with version 3.0. Plus, there are a number of new features you can enjoy.

New features

Upgrading from version 2.0 to version 3.0

New features

Completely reorganized task.xml

In order to support the expanded task configuration features, we've completely reorganized the task.xml file. We've introduced the concept of an engine; completely reworked the way workspaces are configured and built; and localized optional and obligatory pretagging in the definition of steps. You can read more about the new task organization here and here. The MATUpdateTaskXML tool is designed to do most, if not all, of this work for you (and should tell you what it can't do as it tries to do it). Your first step in updating to 3.0 should be to run this tool before installing your (updated) task.

Workflows are now not undoable by default

One of the confounding aspects of the 2.0 implementation of workflows and steps was that the task required a global "undo" order, in order to ensure that when steps were undone in a workflow, all appropriate steps in the task were undone (e.g., if you undid tag and zone in a workflow which doesn't contain tokenization, and some workflow in the task contained tokenization between zone and tag, tokenization was undone as well). This global undo order was impossible to maintain in 3.0, and as a result, it has been abandoned. If you undo steps in a workflow, only those steps will be undone, and as a result, your documents can end up in unusual states (e.g., tokenized but not zoned). In order to compensate for this issue, in 3.0, workflows, by default, are not undoable; the "retreat" buttons in the UI will not be present, for instance. You can specify workflows as undoable using the new "undoable" attribute of the <workflow> element in your task.xml file, but we encourage you to use it sparingly; you should only enable this feature for workflows which support all your tagging steps (content and otherwise).

Workspace database and structure changes

Workspaces now explicitly record the workflow they apply and the language they're supporting. In addition, workspaces have a new "review" folder, for human review, and the contents and metadata associated with reconciliation folders are completely different. As a result, your 2.0 workspaces must be updated to 3.0 (after you've updated your task) using the MATUpdateWorkspace2To3 tool.

Demo has been removed

In 2.0, MAT had a demo configuration capability, which we decided we couldn't afford to maintain. It's been removed in 3.0.

UI file mode advancement buttons have changed, and the "Mark gold" is gone

In 2.0, file mode featured two buttons: a forward arrow to complete the current step, and a backward arrow to undo the current step. In 3.0, we've clarified and sorted out what these operations do, and how they interact with marking gold. MAT now features four types of annotation steps, which are specified in your task.xml file:

The available advance/retreat buttons vary depending on which step you're currently in. A hand icon refers to hand annotation, a gear icon refers to automated annotation, a right arrow refers to marking the step gold, and a left arrow indicates retreating; coupled with a hand icon, the left arrow indicates retreating into the most recent hand annotation phase, rather than undoing the current or previous step. You can find more details here.

The meaning of SEGMENT status has changed, and SEGMENT annotations have changed

In 2.0, MAT introduced SEGMENT annotations, an administrative annotation type which tracks annotation progress. These SEGMENT annotations referred to the document in general; e.g., a "human gold" status indicated that the document was marked gold. In 3.0, these statuses refer to particular annotation sets; so when you mark a document gold in an annotation step (and you can have multiple hand-annotatable steps), you're marking gold the sets associated with the step. In other words, MAT is now tracking (properly, we think) the annotation status of your annotation sets. To support this, each SEGMENT annotation has a "set" attribute, indicating what annotation set it refers to.

Administrative information in MAT documents has changed

The change to SEGMENT annotations is only one way the administrative information in MAT-JSON documents has changed. Another way is that MAT-JSON documents now record their overall progress in terms of the annotation sets that have been added, rather than the workflow steps that have been applied. When 2.0 MAT-JSON documents are read into a 3.0 tool, all this administrative information is automatically updated. As a result, documents saved in 3.0 cannot be used in 2.0.

MATEngine options have changed

As part of the various 3.0 changes, the options to MATEngine have changed slightly:

As part of the change in administrative information, 2.0 administrative information that can't be updated when the document is read is discarded. If, for instance, you've edited your task to split up your content annotations into multiple sets (say, span annotations and relation annotations), the SEGMENT statuses can't be updated consistenly, and "human gold" or "reconciled" statuses will be discarded. You can use the new --mark_gold and --mark_reconciled options of MATEngine to fix this.

--tagger_local and --tagger_model have changed

Because workflows can now have multiple content annotation steps, the --tagger_local and --tagger_model flags have been replaced. These flags can now be specified as --<step_name>_local and --<step_name>_model, where <step_name> is the name of the step. E.g., if one of your tag steps is named "carafe_tag", these flags would be --carafe_tag_local and --carafe_tag_model. In the context of declaring <run_settings> in the context of a <step> in your <workflow> element in task.xml, these flags should be referenced as "local" and "model". The rule of thumb is: in the context of the engines, a prefix is required; in the context of a step, it is forbidden.

All jCarafe option names are now prefixed

Because you can now use jCarafe as a trainable tagger in multiple steps in your workflow, (almost) all the option names associated with the jCarafe tagger are now prefixed in the same way that "local" and "model" are. E.g., if you want to change the recall/precision balance for the carafe_tag step, you must now use --carafe_tag_prior_adjust instead of --prior_adjust in the context of the workflow or engine, or refer simple to "prior_adjust" in the context of declaring a workflow step.

Workspace operations have changed

Before MAT 2.0, workspaces had a "prep" operation, and "import" operation and an "autotag" operation. In 2.0, we removed the "prep" operation and folded it into the "import" operation. In 3.0, any human-annotatable workflow can serve as the basis for a workspace, and the goal of the workspace is to advance automatically to the next human-annotatable point. As a result:

The workspaces now also feature a number of new operations to manage reconciliation and review, which you can learn about here.

Experiment XML and MATExperimentEngine have changed

In order to support the various new workflow and workspace features, the following elements of the experiment XML have changed:

Unless you've used the model_class or <workspace_corpora> features in your experiment, you should not notice these changes in moving from 2.0 to 3.0.

Annotation attribute filling in the UI has been improved

When you fill the value of an annotation attribute in the UI, you have new, more streamlined options in 3.0. First, we've introduced the idea of an "active" annotation editor, if multiple annotation editor windows are open; from the annotation popup menu, you can now add annotations you click on as attribute values in the active annotation editor without returning to the editor window. Second, if your attribute is a set or a list, you can add multiple values without re-enabling the attribute for filling for each value.

Placement of spanless sidebar icons has improved

In 2.0, the default location for spanless annotations without any annotation-valued attributes of their own was at the top of the document. This placement turned out to be problematic for certain spanless annotations (e.g., those representing implicit argument fillers). In 3.0, spanless annotations which have no implicit span information, but are attributes of elements with implicit span information, are positioned next to the elements which point to them.

UI logging output has been changed

A few of the log entry names have been changed, and a number of obsolete entries have been removed.

distinguishing_attribute_for_equality has been removed

The distinguishing_attribute_for_equality attribute in the task.xml file was used pre-2.0 as an input to scoring, and in 2.0 as an input to reconciliation. In 3.0, it's been completely superseded by the similarity configurations, and has been removed.

Upgrading from version 1.3 to version 2.0

New features

No Cygwin support

Support for Cygwin has been removed, because Python in Cygwin does not support sqlite, and sqlite is required for the MAT workspaces in 2.0. Migrate to Windows native.

Python 2.6 or later required

MAT 2.0 makes extensive use of JSON and sqlite, which are best supported in Python 2.6 or 2.7. It also relies on Python's "with" statement, which is supported first in 2.6.

Task.xml schema has changed

Because MAT now explicitly defines the annotations and well-formedness conditions for attributes separately from its display information, the task.xml file has been reorganized. You can use the MATUpdateTaskXML tool to update your task.xml file automatically.

All models must be rebuilt (new version of jCarafe)

The version of jCarafe which is delivered with MAT 2.0 is 0.9.8.5.b-06, which has a different model structure than the version delivered with 1.3. You must rebuild all your models, either using MATModelBuilder (in file mode) or the "modelbuild" operation of MATWorkspaceEngine (in workspace mode).

UI has been completely reorganized, with a new URL

The 1.3 UI used a desktop-in-a-browser metaphor, which raised a number of issues, including poor use of screen real estate. In 2.0, we've completely reorganized the UI, and changed the URL.

mat_controller.sh is replaced by the --spawn_tabbed_terminal option of MATWeb

In previous releases, you really didn't have the option to pass any command-line options to the MATWeb server running under the tabbed terminal. As the command-line options to MATWeb expanded, and became more important, this turned out to be a bad idea. As a result, we've now reorganized the tabbed terminal startup so that it's part of MATWeb. The mat_controller.sh application is gone. The Windows mat_controller.bat script is still present, but it simply invokes MATWeb with the --spawn_tabbed_terminal option.

Workspaces have been completely reorganized

We have completely reorganized the internal structure of workspaces for 2.0. These new workspaces are more powerful and impose fewer requirements on the user. Your MAT 1.3 workspaces cannot be used with MAT 2.0 without modification. We've provided an upgrade tool which will allow you to convert your MAT 1.3 workspaces to MAT 2.0.

The new workspaces feature many fewer folders; a SQLite database which manages the document state information; real transaction and file locking; document assignment, potentially to multiple annotators; extensive logging capabilities; and infrastructure for future capabilities like reconciliation and complex reconciliation workflows, prioritization queues, and segment-by-segment annotation.

As a result of this change, it's no longer possible to run an experiment against a workspace by pointing to, e.g., the "completed" folder. So as part of this change, there's now special support for running experiments against workspaces, both from MATWorkspaceEngine and MATExperimentEngine.

Scorer output ranges have changed

In version 1.3, recall, precision and f-measure were all scaled from 0 to 100. In 2.0, they're scaled from 0 to 1.

CSV spreadsheet management in MATScore and MATExperimentEngine has changed

MATScore and MATExperimentEngine have long supported writing one of three CSV file formats (Excel formulas, OpenOffice formulas, and no formulas). In 2.0, you can now write multiple formats in the same run, and the name of each CSV file clearly indicates the formula type. As a result, the --no_csv_formulas and --oo_separator command-line options have been removed, and replaced with --csv_formula_output.

MATScore --tag_span_details renamed

Because the scorer now provides mismatch details for all conditions, this flag has been renamed to --tag_output_mismatch_details.

MATScore spreadsheet output has changed

Due to enhancements to the scorer, some of the columns in the output spreadsheets have been renamed or moved, and others have a slightly different interpretation. Full details here.

Command-line options to MATWorkspaceEngine have changed

In previous releases, we deprecated, but retained, the "operate" operation in MATWorkspaceEngine. This operation has finally been removed in 2.0. If you had still been doing something like this:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory operate core modelbuild

you should now do this:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory modelbuild core

See the workspace documentation for more details.

Upgrading from version 1.2 to version 1.3

Web server security has been improved

We now provide a separate document on Web server security as it pertains to workspace access. There are a number of new options to MATWeb to support improved security. The most visible effect is that you can restrict access to workspaces from the MAT UI by using the --workspace_container_directory option when you start up the MAT Web server.

Attribute to control right-to-left text display has moved

In version 1.2, the text_right_to_left attribute lived on the workflow element in task.xml; we anticipated that different workflows might be used for different languages within the same task. Since then, we've realized that the task is going to be the appropriate level of encapsulation for language differences for the foreseeable future. Furthermore, the current implementation of right-to-left encoding did not work appropriately with workspaces. Accordingly, we've moved this attribute to the web_customization element, and it is now global to tasks.

Corpus size iteration has changed

The experiment engine has now been extended with general-purpose iterators for sets of values and for value increments. So it's now possible, for instance, to vary the number of model iterations from 20 to 100 by increments of 10 without having to write a separate model set specification for each possible value. These iterators can be combined, in which case you'll get the cross-product of the possible value settings, or you can define your own iterators to get more sophisticated behavior (e.g., iterating over pairs of attribue-value sets). For the user, this means that a couple of attributes have been removed from the experiment engine, and a new set of elements and attributes has been added.

In version 1.2, all you could iterate on was corpus size. The mechanism for this iteration has now changed. In version 1.2, this is what you'd do:

  [...]
<model_sets dir="model_sets">
<build_settings training_increment="4"
truncate_to_increment="yes"/>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]

In version 1.3, it looks like this instead:

  [...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]

You can see that the size processing has been removed from the <build_settings> and added to a new <corpus_settings> element, which contains an instance of the new <iterator> element to specify the type of the iteration. See the documentation and examples for the experiment engine for more details. Note that in version 1.2, you had to specify explicitly that the iteration ends on an increment exactly; in 1.3 this is the default, and to force the final corpus size to be used, you'll need the force_last attribute:

  [...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4" force_last="yes"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]

Experiment spreadsheet columns have been expanded

The experiment engine output spreadsheets have been slightly expanded to include information about the run and model "families" in addition to the actual run and model. This change follows from the introduction of general iterators described above. See the documentation on MATExperimentEngine for details.

Experiment directory structure has changed

In order to support the iterators in the experiment engine, we've reorganized the structure of the experiment directory somewhat. See the documentation on MATExperimentEngine for details.

Upgrading from version 1.1 to version 1.2

New native Windows port

It is now possible to run MAT in Windows without Cygwin installed.

Single distribution bundle for all platforms

Unlike previous versions, there is a single distribution bundle for MAT 1.2 for all supported platforms. For compatibility with Windows, this bundle is now a zip file.

New tabbed terminal for Windows

If you use mat_controller.sh or mat_controller.bat under Windows, you'll find that there's a new tabbed terminal tool we're using, which has the advantage of not requiring Cygwin.

New version of Terminator.app for MacOS X 10.6

If you're using mat_controller.sh under MacOS X, and you intend to install 10.6, note that the previous version of Terminator.app, which supports the tabbed terminal behavior in mat_controller.sh, will not work in 10.6; you must install the newer version provided with MAT 1.2.

Tokenizer has changed

In version 1.2, the original OCaml tokenizer and Carafe trainer/tagger have been replaced by the Java reimplementations. There are a number of important changes that are required as a result. Among other things, the Java tokenizer produces slightly different token boundaries than the original OCaml tokenizer. This is problematic because the entire basis of most annotation systems, including MAT, is the subdivision into words (tokens). In order to have optimal performance, the tokenization of documents which are to be automatically tagged should match the tokenization of the documents which were used to create the tagger model. This means that in order to migrate from version 1.1 to version 1.2, among other things, you must retokenize your documents and update any references to the OCaml tokenizer.

First, to retokenize your documents, we've provided the new MATRetokenize utility. Please back up your data before you run this utility.

Next, if you refer to a tokenization step implementation in your task.xml file, you must change all occurrences of MAT.PluginMgr.CarafeTokenizationStep to MAT.JavaCarafe.CarafeTokenizationStep. You may also need to specify the heap_size attribute on the relevant tokenization <step> in any workflow,  if it turns out that the default Java heap size isn't large enough for your purposes (this attribute can also be specified on the command line; see the Carafe engine documentation).

Trainer/tagger has changed

In version 1.2, the original OCaml tokenizer and Carafe trainer/tagger have been replaced by the Java reimplementations. There are a number of important changes that are required as a result. Among other things, the model format for the Java engine is completely different than the model format for the original OCaml tokenizer. This means that you must rebuild all your models, and update any references to the OCaml trainer/tagger.

First, retokenize your documents using MATRetokenize, as described above, and update your tokenization steps.

Next, update your tagger and trainer settings in task.xml according to the documentation provided for the Carafe engine.

Next, if you refer to a tagging step in your task.xml file, you must change all occurrences of MAT.PluginMgr.CarafeTagStep to MAT.JavaCarafe.CarafeTagStep. You may also need to specify the heap_size attribute on the relevant tag <step> in any workflow,  if it turns out that the default Java heap size isn't large enough for your purposes (this attribute can also be specified on the command line; see the Carafe engine documentation). Similarly, if you have a <model_build_settings> entry, you must change all occurrences of MAT.CarafeModelBuilder.CarafeModelBuilder to MAT.JavaCarafe.CarafeModelBuilder, and possibly specify the heap_size attribute as well. (Note below that you must also change the syntax of <model_build_settings>.)

Note that for the tagger, the prior_adjst attribute has been renamed to prior_adjust. For the trainer, the engine attribute has been eliminated, and the feature_set attribute as well; there's now a new feature_spec attribute which refers to a file in which you can describe your feature set, if you don't want to use the default feature set. Also, the psa_iterations flag has been removed, due to more numerous  options in the Carafe trainer;

psa_iterations="6"

becomes

training_method="psa" max_iterations="6"

Because PSA no longer requires random segments, the no_random_psa_segments flag has been removed.

Finally, use the same tools as before to build your models: either MATModelBuild in file mode, or the modelbuild operation in workspace mode.

Internals of experiment directories have changed

In order to support a more flexible way of specifying partitions in experiments, the way the configuration of experiments is cached has changed in version 1.2. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.1 to regenerate the experiment scores.

Experiment XML files have changed

In order to support a more flexible way of specifying partitions in experiments, we've changed the way partitions are specified in the experiment XML files. We compare the relevant files below:

Version 1.1:

<experiment task='Named Entity'>
<corpora dir="corpora">
<partition split_fraction=".2" ctype="split"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test" corpus="test"/>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test" corpus="test"/>
</runs>
</experiment>

Version 1.2:

<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>

Note the following changes:

Settings in task.xml have changed

In order to clarify how task settings are handled in MAT, a number of changes have been made to the task.xml file syntax.

First, the <step> element of <step_implementations> no longer accepts arbitrary attributes. If you made use of this feature to pass settings to the initialization methods of workflow steps, you must now use the <create_settings> child element. We doubt that anyone has made use of this feature.

Second, the <step> element of <workflow> no longer accepts arbitrary attributes. If you make use of this feature to pass settings to workflow steps, you must now use the <create_settings>, <ui_settings>, or <run_settings> child elements. The most likely situation where this might arise is in passing defaults to the run methods of steps. For instance, if you used this feature to increase the Java heap size for Java Carafe, your task.xml file would have to be revised as follows.

Version 1.1:

  ...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag" heap_size="2G"/>
</workflow>
...
</workflows>
...

Version 1.2:

  ...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag">
<run_settings heap_size="2G"/>
</step>
</workflow>
...
</workflows>
...

Second, the way settings are specified for model configurations has changed. The name and class for the configuration are now separated from the settings which are passed to the model builder, as follows.

Version 1.1:

  ...
<model_build_settings class="MAT.JavaCarafe.CarafeModelBuilder"
training_method="psa" max_iterations="6"/>
</model_build_settings>
...

Version 1.2:

  ...
<model_config class="MAT.JavaCarafe.CarafeModelBuilder">
<build_settings training_method="psa" max_iterations="6"/>
</model_config>
...

Finally, the <workflow> element no longer accepts arbitrary settings; these settings must be passed using the <ui_settings> child element. No task appears to use this option yet, so this shouldn't affect anyone.

Upgrading from version 1.0 to version 1.1

Internals of experiment directories have changed

In order to support a more flexible way of invoking the MAT engine in experiments, the way the configuration of experiments is cached has changed in version 1.1. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.0 to regenerate the experiment scores.

Experiment XML files have changed

In order to support a more flexible way of invoking the MAT engine in experiments, we've changed the way corpus preprocessing and test run processing are specified. In version 1.0, the MAT engine was called as a command-line tool, and the options were specified as a command line; in version 1.1, the options are specified as XML attribute-value pairs. We compare the relevant experiment XML blocks below:

Version 1.0:

  <corpora dir="corpora">
<prep>--input_file_type xml-inline --workflow Align --steps 'zone,tokenize,align'</prep>
[...]
</corpora>

<runs dir="runs">
<run_settings>
<args>--steps zone,tokenize,tag --workflow Demo</args>
</run_settings>
[...]
</runs>

Version 1.1:

  <corpora dir="corpora">
<prep input_file_type="xml-inline" workflow="Align" steps="zone,tokenize,align"/>
[...]
</corpora>

<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
[...]
</runs>

New training engine configuration in task.xml

Version 1.1 adds the ability to define different training engines. Because of this change, if you've defined your own task and you specified model build settings in your task.xml file, you must add a class attribute to the model_build_settings element. This attribute is not optional, and there is no default. If you're using the default Carafe engine, the value you should use for this attribute is MAT.CarafeModelBuilder.CarafeModelBuilder, as in the following example:

  <model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder" 
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>

New folder in workspaces

Version 1.1 adds the ability to import MAT JSON documents into your workspaces which haven't yet been processed (as well as other annotation formats, like XML inline). Because of this change, if you have a workspace, you must add a directory to it. This directory is expected by the MAT workspace engine. For each workspace directory, do this:

% mkdir <workspace_dir>/folders/rich_incoming

New command line option restriction for MATModelBuilder

In version 1.1, it's possible to have multiple model build configurations in your task.xml file. In order to ensure that the correct configuration adds the appropriate command line options to the MATModelBuilder executable, it was necessary to introduce a new restriction on the  --task option for MATModelBuilder: if it appears, it must now be the first command-line option. In other words, the following will now raise an error:

% $MAT_PKG_HOME/bin/MATModelBuilder \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs --task "Named Entity" \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model

Change to how the default model is specified in task.xml

In version 1.0, the default model was defined within the model build settings. In version 1.1, because of the presence of multiple model bulid configurations, we've separated the specification of the default model in task.xml.

Version 1.0:

  <model_build_settings engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6" default_model="default_model"/>

Version 1.1:

  <model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder" 
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
<default_model>default_model</default_model>