Upgrade Notes

If you've received a previous version of MAT, this page contains instructions on how to upgrade to the new version.

Upgrading from version 3.0 to version 3.1

Version 3.1 is completely backward compatible with version 3.0. Plus, there are a number of new features you can enjoy.

New features

Upgrading from version 2.0 to version 3.0

New features

Completely reorganized task.xml

In order to support the expanded task configuration features, we've completely reorganized the task.xml file. We've introduced the concept of an engine; completely reworked the way workspaces are configured and built; and localized optional and obligatory pretagging in the definition of steps. You can read more about the new task organization here and here. The MATUpdateTaskXML tool is designed to do most, if not all, of this work for you (and should tell you what it can't do as it tries to do it). Your first step in updating to 3.0 should be to run this tool before installing your (updated) task.

Workflows are now not undoable by default

One of the confounding aspects of the 2.0 implementation of workflows and steps was that the task required a global "undo" order, in order to ensure that when steps were undone in a workflow, all appropriate steps in the task were undone (e.g., if you undid tag and zone in a workflow which doesn't contain tokenization, and some workflow in the task contained tokenization between zone and tag, tokenization was undone as well). This global undo order was impossible to maintain in 3.0, and as a result, it has been abandoned. If you undo steps in a workflow, only those steps will be undone, and as a result, your documents can end up in unusual states (e.g., tokenized but not zoned). In order to compensate for this issue, in 3.0, workflows, by default, are not undoable; the "retreat" buttons in the UI will not be present, for instance. You can specify workflows as undoable using the new "undoable" attribute of the <workflow> element in your task.xml file, but we encourage you to use it sparingly; you should only enable this feature for workflows which support all your tagging steps (content and otherwise).

Workspace database and structure changes

Workspaces now explicitly record the workflow they apply and the language they're supporting. In addition, workspaces have a new "review" folder, for human review, and the contents and metadata associated with reconciliation folders are completely different. As a result, your 2.0 workspaces must be updated to 3.0 (after you've updated your task) using the MATUpdateWorkspace2To3 tool.

Demo has been removed

In 2.0, MAT had a demo configuration capability, which we decided we couldn't afford to maintain. It's been removed in 3.0.

UI file mode advancement buttons have changed, and the "Mark gold" is gone

In 2.0, file mode featured two buttons: a forward arrow to complete the current step, and a backward arrow to undo the current step. In 3.0, we've clarified and sorted out what these operations do, and how they interact with marking gold. MAT now features four types of annotation steps, which are specified in your task.xml file:

The available advance/retreat buttons vary depending on which step you're currently in. A hand icon refers to hand annotation, a gear icon refers to automated annotation, a right arrow refers to marking the step gold, and a left arrow indicates retreating; coupled with a hand icon, the left arrow indicates retreating into the most recent hand annotation phase, rather than undoing the current or previous step. You can find more details here.

The meaning of SEGMENT status has changed, and SEGMENT annotations have changed

In 2.0, MAT introduced SEGMENT annotations, an administrative annotation type which tracks annotation progress. These SEGMENT annotations referred to the document in general; e.g., a "human gold" status indicated that the document was marked gold. In 3.0, these statuses refer to particular annotation sets; so when you mark a document gold in an annotation step (and you can have multiple hand-annotatable steps), you're marking gold the sets associated with the step. In other words, MAT is now tracking (properly, we think) the annotation status of your annotation sets. To support this, each SEGMENT annotation has a "set" attribute, indicating what annotation set it refers to.

Administrative information in MAT documents has changed

The change to SEGMENT annotations is only one way the administrative information in MAT-JSON documents has changed. Another way is that MAT-JSON documents now record their overall progress in terms of the annotation sets that have been added, rather than the workflow steps that have been applied. When 2.0 MAT-JSON documents are read into a 3.0 tool, all this administrative information is automatically updated. As a result, documents saved in 3.0 cannot be used in 2.0.

MATEngine options have changed

As part of the various 3.0 changes, the options to MATEngine have changed slightly:

As part of the change in administrative information, 2.0 administrative information that can't be updated when the document is read is discarded. If, for instance, you've edited your task to split up your content annotations into multiple sets (say, span annotations and relation annotations), the SEGMENT statuses can't be updated consistenly, and "human gold" or "reconciled" statuses will be discarded. You can use the new --mark_gold and --mark_reconciled options of MATEngine to fix this.

--tagger_local and --tagger_model have changed

Because workflows can now have multiple content annotation steps, the --tagger_local and --tagger_model flags have been replaced. These flags can now be specified as --<step_name>_local and --<step_name>_model, where <step_name> is the name of the step. E.g., if one of your tag steps is named "carafe_tag", these flags would be --carafe_tag_local and --carafe_tag_model. In the context of declaring <run_settings> in the context of a <step> in your <workflow> element in task.xml, these flags should be referenced as "local" and "model". The rule of thumb is: in the context of the engines, a prefix is required; in the context of a step, it is forbidden.

All jCarafe option names are now prefixed

Because you can now use jCarafe as a trainable tagger in multiple steps in your workflow, (almost) all the option names associated with the jCarafe tagger are now prefixed in the same way that "local" and "model" are. E.g., if you want to change the recall/precision balance for the carafe_tag step, you must now use --carafe_tag_prior_adjust instead of --prior_adjust in the context of the workflow or engine, or refer simple to "prior_adjust" in the context of declaring a workflow step.

Workspace operations have changed

Before MAT 2.0, workspaces had a "prep" operation, and "import" operation and an "autotag" operation. In 2.0, we removed the "prep" operation and folded it into the "import" operation. In 3.0, any human-annotatable workflow can serve as the basis for a workspace, and the goal of the workspace is to advance automatically to the next human-annotatable point. As a result:

The workspaces now also feature a number of new operations to manage reconciliation and review, which you can learn about here.

Experiment XML and MATExperimentEngine have changed

In order to support the various new workflow and workspace features, the following elements of the experiment XML have changed:

Unless you've used the model_class or <workspace_corpora> features in your experiment, you should not notice these changes in moving from 2.0 to 3.0.

Annotation attribute filling in the UI has been improved

When you fill the value of an annotation attribute in the UI, you have new, more streamlined options in 3.0. First, we've introduced the idea of an "active" annotation editor, if multiple annotation editor windows are open; from the annotation popup menu, you can now add annotations you click on as attribute values in the active annotation editor without returning to the editor window. Second, if your attribute is a set or a list, you can add multiple values without re-enabling the attribute for filling for each value.

Placement of spanless sidebar icons has improved

In 2.0, the default location for spanless annotations without any annotation-valued attributes of their own was at the top of the document. This placement turned out to be problematic for certain spanless annotations (e.g., those representing implicit argument fillers). In 3.0, spanless annotations which have no implicit span information, but are attributes of elements with implicit span information, are positioned next to the elements which point to them.

UI logging output has been changed

A few of the log entry names have been changed, and a number of obsolete entries have been removed.

distinguishing_attribute_for_equality has been removed

The distinguishing_attribute_for_equality attribute in the task.xml file was used pre-2.0 as an input to scoring, and in 2.0 as an input to reconciliation. In 3.0, it's been completely superseded by the similarity configurations, and has been removed.

Upgrading from version 1.3 to version 2.0

New features

No Cygwin support

Support for Cygwin has been removed, because Python in Cygwin does not support sqlite, and sqlite is required for the MAT workspaces in 2.0. Migrate to Windows native.

Python 2.6 or later required

MAT 2.0 makes extensive use of JSON and sqlite, which are best supported in Python 2.6 or 2.7. It also relies on Python's "with" statement, which is supported first in 2.6.

Task.xml schema has changed

Because MAT now explicitly defines the annotations and well-formedness conditions for attributes separately from its display information, the task.xml file has been reorganized. You can use the MATUpdateTaskXML tool to update your task.xml file automatically.

All models must be rebuilt (new version of jCarafe)

The version of jCarafe which is delivered with MAT 2.0 is 0.9.8.5.b-06, which has a different model structure than the version delivered with 1.3. You must rebuild all your models, either using MATModelBuilder (in file mode) or the "modelbuild" operation of MATWorkspaceEngine (in workspace mode).

UI has been completely reorganized, with a new URL

The 1.3 UI used a desktop-in-a-browser metaphor, which raised a number of issues, including poor use of screen real estate. In 2.0, we've completely reorganized the UI, and changed the URL.

mat_controller.sh is replaced by the --spawn_tabbed_terminal option of MATWeb

In previous releases, you really didn't have the option to pass any command-line options to the MATWeb server running under the tabbed terminal. As the command-line options to MATWeb expanded, and became more important, this turned out to be a bad idea. As a result, we've now reorganized the tabbed terminal startup so that it's part of MATWeb. The mat_controller.sh application is gone. The Windows mat_controller.bat script is still present, but it simply invokes MATWeb with the --spawn_tabbed_terminal option.

Workspaces have been completely reorganized

We have completely reorganized the internal structure of workspaces for 2.0. These new workspaces are more powerful and impose fewer requirements on the user. Your MAT 1.3 workspaces cannot be used with MAT 2.0 without modification. We've provided an upgrade tool which will allow you to convert your MAT 1.3 workspaces to MAT 2.0.

The new workspaces feature many fewer folders; a SQLite database which manages the document state information; real transaction and file locking; document assignment, potentially to multiple annotators; extensive logging capabilities; and infrastructure for future capabilities like reconciliation and complex reconciliation workflows, prioritization queues, and segment-by-segment annotation.

As a result of this change, it's no longer possible to run an experiment against a workspace by pointing to, e.g., the "completed" folder. So as part of this change, there's now special support for running experiments against workspaces, both from MATWorkspaceEngine and MATExperimentEngine.

Scorer output ranges have changed

In version 1.3, recall, precision and f-measure were all scaled from 0 to 100. In 2.0, they're scaled from 0 to 1.

CSV spreadsheet management in MATScore and MATExperimentEngine has changed

MATScore and MATExperimentEngine have long supported writing one of three CSV file formats (Excel formulas, OpenOffice formulas, and no formulas). In 2.0, you can now write multiple formats in the same run, and the name of each CSV file clearly indicates the formula type. As a result, the --no_csv_formulas and --oo_separator command-line options have been removed, and replaced with --csv_formula_output.

MATScore --tag_span_details renamed

Because the scorer now provides mismatch details for all conditions, this flag has been renamed to --tag_output_mismatch_details.

MATScore spreadsheet output has changed

Due to enhancements to the scorer, some of the columns in the output spreadsheets have been renamed or moved, and others have a slightly different interpretation. Full details here.

Command-line options to MATWorkspaceEngine have changed

In previous releases, we deprecated, but retained, the "operate" operation in MATWorkspaceEngine. This operation has finally been removed in 2.0. If you had still been doing something like this:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory operate core modelbuild

you should now do this:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine mydirectory modelbuild core

See the workspace documentation for more details.

Upgrading from version 1.2 to version 1.3

Web server security has been improved

We now provide a separate document on Web server security as it pertains to workspace access. There are a number of new options to MATWeb to support improved security. The most visible effect is that you can restrict access to workspaces from the MAT UI by using the --workspace_container_directory option when you start up the MAT Web server.

Attribute to control right-to-left text display has moved

In version 1.2, the text_right_to_left attribute lived on the workflow element in task.xml; we anticipated that different workflows might be used for different languages within the same task. Since then, we've realized that the task is going to be the appropriate level of encapsulation for language differences for the foreseeable future. Furthermore, the current implementation of right-to-left encoding did not work appropriately with workspaces. Accordingly, we've moved this attribute to the web_customization element, and it is now global to tasks.

Corpus size iteration has changed

The experiment engine has now been extended with general-purpose iterators for sets of values and for value increments. So it's now possible, for instance, to vary the number of model iterations from 20 to 100 by increments of 10 without having to write a separate model set specification for each possible value. These iterators can be combined, in which case you'll get the cross-product of the possible value settings, or you can define your own iterators to get more sophisticated behavior (e.g., iterating over pairs of attribue-value sets). For the user, this means that a couple of attributes have been removed from the experiment engine, and a new set of elements and attributes has been added.

In version 1.2, all you could iterate on was corpus size. The mechanism for this iteration has now changed. In version 1.2, this is what you'd do:

  [...]
<model_sets dir="model_sets">
<build_settings training_increment="4"
truncate_to_increment="yes"/>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]

In version 1.3, it looks like this instead:

  [...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]

You can see that the size processing has been removed from the <build_settings> and added to a new <corpus_settings> element, which contains an instance of the new <iterator> element to specify the type of the iteration. See the documentation and examples for the experiment engine for more details. Note that in version 1.2, you had to specify explicitly that the iteration ends on an increment exactly; in 1.3 this is the default, and to force the final corpus size to be used, you'll need the force_last attribute:

  [...]
<model_sets dir="model_sets">
<corpus_settings>
<iterator type="corpus_size" increment="4" force_last="yes"/>
</corpus_settings>
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
[...]

Experiment spreadsheet columns have been expanded

The experiment engine output spreadsheets have been slightly expanded to include information about the run and model "families" in addition to the actual run and model. This change follows from the introduction of general iterators described above. See the documentation on MATExperimentEngine for details.

Experiment directory structure has changed

In order to support the iterators in the experiment engine, we've reorganized the structure of the experiment directory somewhat. See the documentation on MATExperimentEngine for details.

Upgrading from version 1.1 to version 1.2

New native Windows port

It is now possible to run MAT in Windows without Cygwin installed.

Single distribution bundle for all platforms

Unlike previous versions, there is a single distribution bundle for MAT 1.2 for all supported platforms. For compatibility with Windows, this bundle is now a zip file.

New tabbed terminal for Windows

If you use mat_controller.sh or mat_controller.bat under Windows, you'll find that there's a new tabbed terminal tool we're using, which has the advantage of not requiring Cygwin.

New version of Terminator.app for MacOS X 10.6

If you're using mat_controller.sh under MacOS X, and you intend to install 10.6, note that the previous version of Terminator.app, which supports the tabbed terminal behavior in mat_controller.sh, will not work in 10.6; you must install the newer version provided with MAT 1.2.

Tokenizer has changed

In version 1.2, the original OCaml tokenizer and Carafe trainer/tagger have been replaced by the Java reimplementations. There are a number of important changes that are required as a result. Among other things, the Java tokenizer produces slightly different token boundaries than the original OCaml tokenizer. This is problematic because the entire basis of most annotation systems, including MAT, is the subdivision into words (tokens). In order to have optimal performance, the tokenization of documents which are to be automatically tagged should match the tokenization of the documents which were used to create the tagger model. This means that in order to migrate from version 1.1 to version 1.2, among other things, you must retokenize your documents and update any references to the OCaml tokenizer.

First, to retokenize your documents, we've provided the new MATRetokenize utility. Please back up your data before you run this utility.

Next, if you refer to a tokenization step implementation in your task.xml file, you must change all occurrences of MAT.PluginMgr.CarafeTokenizationStep to MAT.JavaCarafe.CarafeTokenizationStep. You may also need to specify the heap_size attribute on the relevant tokenization <step> in any workflow,  if it turns out that the default Java heap size isn't large enough for your purposes (this attribute can also be specified on the command line; see the Carafe engine documentation).

Trainer/tagger has changed

In version 1.2, the original OCaml tokenizer and Carafe trainer/tagger have been replaced by the Java reimplementations. There are a number of important changes that are required as a result. Among other things, the model format for the Java engine is completely different than the model format for the original OCaml tokenizer. This means that you must rebuild all your models, and update any references to the OCaml trainer/tagger.

First, retokenize your documents using MATRetokenize, as described above, and update your tokenization steps.

Next, update your tagger and trainer settings in task.xml according to the documentation provided for the Carafe engine.

Next, if you refer to a tagging step in your task.xml file, you must change all occurrences of MAT.PluginMgr.CarafeTagStep to MAT.JavaCarafe.CarafeTagStep. You may also need to specify the heap_size attribute on the relevant tag <step> in any workflow,  if it turns out that the default Java heap size isn't large enough for your purposes (this attribute can also be specified on the command line; see the Carafe engine documentation). Similarly, if you have a <model_build_settings> entry, you must change all occurrences of MAT.CarafeModelBuilder.CarafeModelBuilder to MAT.JavaCarafe.CarafeModelBuilder, and possibly specify the heap_size attribute as well. (Note below that you must also change the syntax of <model_build_settings>.)

Note that for the tagger, the prior_adjst attribute has been renamed to prior_adjust. For the trainer, the engine attribute has been eliminated, and the feature_set attribute as well; there's now a new feature_spec attribute which refers to a file in which you can describe your feature set, if you don't want to use the default feature set. Also, the psa_iterations flag has been removed, due to more numerous  options in the Carafe trainer;

psa_iterations="6"

becomes

training_method="psa" max_iterations="6"

Because PSA no longer requires random segments, the no_random_psa_segments flag has been removed.

Finally, use the same tools as before to build your models: either MATModelBuild in file mode, or the modelbuild operation in workspace mode.

Internals of experiment directories have changed

In order to support a more flexible way of specifying partitions in experiments, the way the configuration of experiments is cached has changed in version 1.2. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.1 to regenerate the experiment scores.

Experiment XML files have changed

In order to support a more flexible way of specifying partitions in experiments, we've changed the way partitions are specified in the experiment XML files. We compare the relevant files below:

Version 1.1:

<experiment task='Named Entity'>
<corpora dir="corpora">
<partition split_fraction=".2" ctype="split"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test" corpus="test"/>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test" corpus="test"/>
</runs>
</experiment>

Version 1.2:

<experiment task='Named Entity'>
<corpora dir="corpora">
<partition name="train" fraction=".8"/>
<partition name="test" fraction=".2"/>
<corpus name="test">
<pattern>*.json</pattern>
</corpus>
</corpora>
<model_sets dir="model_sets">
<model_set name="test">
<training_corpus corpus="test" partition="train"/>
</model_set>
</model_sets>
<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
<run name="test" model="test">
<test_corpus corpus="test" partition="test"/>
</run>
</runs>
</experiment>

Note the following changes:

Settings in task.xml have changed

In order to clarify how task settings are handled in MAT, a number of changes have been made to the task.xml file syntax.

First, the <step> element of <step_implementations> no longer accepts arbitrary attributes. If you made use of this feature to pass settings to the initialization methods of workflow steps, you must now use the <create_settings> child element. We doubt that anyone has made use of this feature.

Second, the <step> element of <workflow> no longer accepts arbitrary attributes. If you make use of this feature to pass settings to workflow steps, you must now use the <create_settings>, <ui_settings>, or <run_settings> child elements. The most likely situation where this might arise is in passing defaults to the run methods of steps. For instance, if you used this feature to increase the Java heap size for Java Carafe, your task.xml file would have to be revised as follows.

Version 1.1:

  ...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag" heap_size="2G"/>
</workflow>
...
</workflows>
...

Version 1.2:

  ...
<workflows>
<workflow name="Demo" hand_annotation_available_at_end="yes">
<step name="zone"/>
<step name="tokenize"/>
<step name="tag">
<run_settings heap_size="2G"/>
</step>
</workflow>
...
</workflows>
...

Second, the way settings are specified for model configurations has changed. The name and class for the configuration are now separated from the settings which are passed to the model builder, as follows.

Version 1.1:

  ...
<model_build_settings class="MAT.JavaCarafe.CarafeModelBuilder"
training_method="psa" max_iterations="6"/>
</model_build_settings>
...

Version 1.2:

  ...
<model_config class="MAT.JavaCarafe.CarafeModelBuilder">
<build_settings training_method="psa" max_iterations="6"/>
</model_config>
...

Finally, the <workflow> element no longer accepts arbitrary settings; these settings must be passed using the <ui_settings> child element. No task appears to use this option yet, so this shouldn't affect anyone.

Upgrading from version 1.0 to version 1.1

Internals of experiment directories have changed

In order to support a more flexible way of invoking the MAT engine in experiments, the way the configuration of experiments is cached has changed in version 1.1. What this means is that you will not be able to invoke MATExperimentEngine on experiment directories created using version 1.0 to regenerate the experiment scores.

Experiment XML files have changed

In order to support a more flexible way of invoking the MAT engine in experiments, we've changed the way corpus preprocessing and test run processing are specified. In version 1.0, the MAT engine was called as a command-line tool, and the options were specified as a command line; in version 1.1, the options are specified as XML attribute-value pairs. We compare the relevant experiment XML blocks below:

Version 1.0:

  <corpora dir="corpora">
<prep>--input_file_type xml-inline --workflow Align --steps 'zone,tokenize,align'</prep>
[...]
</corpora>

<runs dir="runs">
<run_settings>
<args>--steps zone,tokenize,tag --workflow Demo</args>
</run_settings>
[...]
</runs>

Version 1.1:

  <corpora dir="corpora">
<prep input_file_type="xml-inline" workflow="Align" steps="zone,tokenize,align"/>
[...]
</corpora>

<runs dir="runs">
<run_settings>
<args steps="zone,tokenize,tag" workflow="Demo"/>
</run_settings>
[...]
</runs>

New training engine configuration in task.xml

Version 1.1 adds the ability to define different training engines. Because of this change, if you've defined your own task and you specified model build settings in your task.xml file, you must add a class attribute to the model_build_settings element. This attribute is not optional, and there is no default. If you're using the default Carafe engine, the value you should use for this attribute is MAT.CarafeModelBuilder.CarafeModelBuilder, as in the following example:

  <model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder" 
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>

New folder in workspaces

Version 1.1 adds the ability to import MAT JSON documents into your workspaces which haven't yet been processed (as well as other annotation formats, like XML inline). Because of this change, if you have a workspace, you must add a directory to it. This directory is expected by the MAT workspace engine. For each workspace directory, do this:

% mkdir <workspace_dir>/folders/rich_incoming

New command line option restriction for MATModelBuilder

In version 1.1, it's possible to have multiple model build configurations in your task.xml file. In order to ensure that the correct configuration adds the appropriate command line options to the MATModelBuilder executable, it was necessary to introduce a new restriction on the  --task option for MATModelBuilder: if it appears, it must now be the first command-line option. In other words, the following will now raise an error:

% $MAT_PKG_HOME/bin/MATModelBuilder \
--input_files '/path/to/my/docs/1[0-9][0-9].json' \
--input_dir /path/to/my/other/docs --task "Named Entity" \
--lexicon_dir /path/to/my/lexicon/ --save_as_default_model

Change to how the default model is specified in task.xml

In version 1.0, the default model was defined within the model build settings. In version 1.1, because of the presence of multiple model bulid configurations, we've separated the specification of the default model in task.xml.

Version 1.0:

  <model_build_settings engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6" default_model="default_model"/>

Version 1.1:

  <model_build_settings class="MAT.CarafeModelBuilder.CarafeModelBuilder" 
engine="anonTrain.native" feature_set="ANON-1"
psa_iterations="6"/>
<default_model>default_model</default_model>