The default jCarafe engine

While it's possible to use your own training and tagging engine in MAT, MAT provides jCarafe, a CRF-based sequence tagger, as a default training and tagging engine for simple spans (i.e., span plus true or effective labels). The flags for configuring jCarafe are available in a number of locations in MAT.

While jCarafe only supports simple span tagging, it is not limited to tasks which consist only of such labels. When presented with a more complex task, jCarafe will build models for, and insert, the true or effective labels in the task (i.e., it will ignore all spanless annotations, and ignore all attributes other than those related to effective labels).

jCarafe also provides an English tokenizer.

The jCarafe documentation

MAT is distributed with the jCarafe documentation, which is located here if you're viewing this documentation via a Web server, or in src/jcarafe.../resources if you've received MAT as a zip file.

The tokenizer

The jCarafe tokenizer engine wrapper is MAT.JavaCarafe.CarafeTokenizationStep. This class should be referenced in the <step_config> element of an <engine> in your task.xml file.

Options

Command line option
XML attribute
Value
Description
--heap_size <s>
heap_size
a heap size for the Java VM
The jCarafe tokenizer is a Java application which tokenizes batches of documents at a time, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--stack_size <s>
stack_size
a stack size for the Java VM
The jCarafe tokenizer is a Java application which tokenizes batches of documents at a time, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--handle_tags
handle_tags
"yes" (XML)
If present, treat the signal as XML and tokenize XML elements and entities as single tokens.
--tokenizer_patterns
tokenizer_patterns
a string
See the jCarafe docs on --tokenizer-patterns.

The tagging engine

The jCarafe tagging engine wrapper is MAT.JavaCarafe.CarafeTagStep. This class should be referenced in the <step_config> element of an <engine> in your task.xml file.

The jCarafe tagger is sensitive to SEGMENTs. It will only insert annotations into SEGMENTs for the current step whose annotator = "MACHINE" or annotator = null. If no segments are found, it will insert annotations into all the zones; if no zones are found, it will insert annotations into the entire document. Any SEGMENT into which annotations are inserted will be marked annotator = "MACHINE".

Note that because the jCarafe tagger engine can be used in multiple steps, in any call to the processing engine the names of the jCarafe tagger options and attributes must receive a prefix. This prefix is the name of the step (not the pretty name) which uses jCarafe as its engine. This step name is preprocessed by replacing all non-alphanumeric sequences by a single underscore. These prefixes must be used in the following contexts:

These prefixes must not be used in the following contexts:

Options

Command line option
XML attribute
Value
Description
--<step_name>_local
<step_name>_local
"yes" (XML)
By default, the MAT engine will contact the MAT Web server to tag a document, because the Web server has the capability of starting up and monitoring a long-living tagger task. The reason this is beneficial is that the jCarafe tagger, like many model-based taggers, has a fairly expensive startup cost. To block the engine from contacting the Web server, and force it to start up and shut down the tagger on its own, specify <step_name>_local="yes".
--<step_name>_model <model>
<step_name>_model
a string, a filename of a jCarafe model
If the task does not have a default model, the user must specify the location of the tagger model.
--<step_name>_prior_adjust
<step_name>_prior_adjust
a float
The jCarafe tagger can be biased toward recall or toward precision. This setting biases the jCarafe tagger to favor precision (positive values) or recall (negative values). Default is -1.0 (slight recall bias). Practical range of values is usually +-6.0.
--heap_size <s>
heap_size
a heap size for the Java VM
The jCarafe tagger is a Java application, and the default heap size may not be adequate for your model. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--stack_size <s>
stack_size
a stack size for the Java VM
The jCarafe tagger is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--<step_name>_tagging_pre_models <s>
<step_name>_tagging_pre_models
a string
If present, a comma-separated list of glob-style patterns specifying the models to include as pre-taggers. This is an advanced feature that normal users will not be using.
--<step_name>_add_tokens_internally
<step_name>_add_tokens_internally
"yes" (XML)
If present, jCarafe will use its internal tokenizer to tokenize the document before tagging. If your workflow doesn't tokenize the document, you must provide this flag, or jCarafe will have no tokens to base its tagging on. We recommend strongly that you tokenize your documents separately; you should not use this flag.
--<step_name>_capture_token_confidences
<step_name>_capture_token_confidences
"yes" (XML)
If present, jCarafe will capture token confidence metrics for later exploitation.
--<step_name>_capture_sequence_confidences
<step_name>_capture_sequence_confidences
"yes" (XML)
If present, jCarafe will capture sequence confidence metrics for later exploitation.
--<step_name>_parallel
<step_name>_parallel
"yes" (XML)
If present, parallelizes the decoding.
--<step_name>_nthreads
<step_name>_nthreads
an integer
If --<step_name>_parallel is used, controls the number of threads used for decoding.

The training engine

The jCarafe training engine class is MAT.JavaCarafe.CarafeModelBuilder. You should reference this class in the <model_config> element of the <engine> in your task.xml file which references the jCarafe tagging engine.

The jCarafe trainer is sensitive to SEGMENTs. It will train on all SEGMENTs for the relevant step which have been which have been touched by a human annotator, whether or not they're gold (see, however, the --partial_training_on_gold_only option). If no SEGMENTs are found, it will use all the zones; if no zones are found, it will use the entire document. The scope of the training is important; any blank regions are treated as implicitly negative information (i.e., the trainer will conclude that there's no annotation there on purpose).

Basic options

There is only a few settings here that you should change on any regular basis:

Command line option
XML attribute
Value
Description
--lexicon_dir <dir>
lexicon_dir
a pathname
If present, the name of a directory which contains a jCarafe training lexicon. This pathname should be an absolute pathname, and should have a trailing slash. The content of the directory should be a set of files, each of which contains a sequence of tokens, one per line. The name of the file will be used as a training feature for the token. You can use this feature, for instance, to provide implicit part-of-speech information (e.g., create a file named ADJ which contains a sequence of words that are adjectives) or name information (e.g., create a file named NAME which contains a sequence of tokens which can occur in proper names).

On the command line, overrides any possible default in the <build_settings> for the relevant model config in the task.xml file for the task.

Note that the interpretation of the contents of the lexicons depends on the jCarafe feature specification.
--heap_size <s>
heap_size
a heap size for the Java VM
The jCarafe trainer is a Java application, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--stack_size <s>
stack_size
a stack size for the Java VM
The jCarafe trainer is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--parallel
parallel
"yes" (XML)
If present, parallelizes the training.
--nthreads
nthreads
an integer
If --parallel is used, controls the number of threads used for training.

Advanced options

The options in this section are documented here for completeness. If you're not familiar with the jCarafe training engine and its implementation, the chances are that you'll never use any of these values. If you want to use them, someone who is knowledgeable about jCarafe should set these values for you in task.xml, and unless you really know what you're doing, you should not override them on the command line.

jCarafe provides the option of using non-standard training methods. One of those methods is called periodic stepsize adjustment (PSA). This method, when used correctly, is significantly faster than the normal training mechanism. However, it sometimes performs less well in situations which are not yet clear. You might prefer to use it if you're doing comparative analysis of multiple models, or you're just starting off with a rough-and-ready system and you don't need to optimize on accuracy yet. The --max_iterations flag governs the number of training cycles; more is not necessarily better, because the engine may overfit to the data.

The documentation for the --feature_spec flag below refers to jCarafe feature spec files. The documentation for how to create these files can be found in the jCarafe documentation. Similarly, if you want details on the --gaussian_prior, --no_begin, --l1, and --l1_c flags, see the jCarafe documentation.

The documentation for --tags and --pre_models refers to an advanced feature of jCarafe where it can use tagging models to generate input features for multi-stage tagging. We will not discuss this advanced capability of jCarafe any further.

Command line option
XML attribute
Value
Description
--feature_spec <file>
feature_spec
a filename
Name of the file that contains the jCarafe feature specification. The default specification will be used if none is provided. If the filename is not an absolute filename, it will be interpreted relative to the directory of the task which is being trained for. (This is because this option more likely to be provided in your task.xml file rather than on the command line.)

On the command line, optional if feature_spec is set in the <build_settings> for the relevant model config in the task.xml file for the task.
--training_method <meth>
training_method
"psa"
If present, specify a training method other than the standard method. Currently, the only recognized value is psa. The psa method is noticeably faster, but may result in somewhat poorer results. You can use a value of '' to override a previously specified training method (e.g., a default method in your task).
--max_iterations <num>
max_iterations
an integer
Number of iterations for the training mechanism to use. Current defaults are 200 for standard training, 10 for PSA training. On the command line, overrides any possible default in the <build_settings> for the relevant model config in the task.xml file for the task.
---tags <s>
tags
a string
If present, a comma-separated list of tags to pass to the training engine instead of the full tag set for the task (used to create per-tag pre-tagging models for multi-stage training and tagging).
--pre_models <s>
pre_models
a string
If present, a comma-separated list of glob-style patterns specifying the models to include as pre-taggers.
--gaussian_prior <f>
gaussian_prior
a float
A positive float, default is 10.0. See the jCarafe docs for details.
--no_begin
no_begin
"yes" (XML)
Don't introduce begin states during training. Useful if you're certain that you won't have any adjacent spans with the same label. See the jCarafe documentation for more details.
--l1
l1
"yes" (XML)
Use L1 regularization for PSA training. See the jCarafe docs for details.
--l1_c <f>
l1_c
a float
Change the penalty factor for the L1 regularizer. See the jCarafe docs for details.
--add_tokens_internally
add_tokens_internally
"yes" (XML)
If present, jCarafe will use its internal tokenizer to tokenize the document before training. If your workflow doesn't tokenize the document, you must provide this flag, or jCarafe will have no tokens to base its training on. We recommend strongly that you tokenize your documents separately; you should not use this flag.
--partial_training_on_gold_only
partial_training_on_gold_only
"yes" (XML)
When jCarafe is presented with partially tagged documents, by default MAT will ask jCarafe to train on all annotated segments, gold or not. If this flag is specified, only "human gold" or "reconciled" segments will be used for training.
--word_properties
word_properties
a string
see the jCarafe docs on --word-properties.
--word_scores
word_scores
a string
see the jCarafe docs on --word-scores.
--learning_rate
learning_rate
a string
see the jCarafe docs on --learning-rate.
--disk_cache
disk_cache
a string
see the jCarafe docs on --disk-cache.

A note about feature specifications

jCarafe uses a declarative representation for describing the features it will use when training and tagging. The default feature specification file is found in resources/default.fspec in your jCarafe directory. The jCarafe docs describe the meaning of the contents of these files, and how to write your own if you so choose.

In some cases, the feature specification interacts with trainer parameters. For instance, the contents of the lexicon files provided by the --lexicon_dir option are interpreted according to features in the feature specification file, in particular the lexFn and downLexFn specs, as follows:

So, for example, if you're annotating names, and all the names appear with consistent capitalization in your documents, you can use "lexFn" and list the elements in your lexicon file using the appropriate capitalization. On the other hand, if the capitalization is inconsistent, then you should use the case-insensitive option.

The default feature spec file distributed with MAT contains a "lexFn" feature, for case-sensitive matching.