The default jCarafe engine

While it's possible to use your own training and tagging engine in MAT, MAT provides jCarafe, a CRF-based sequence tagger and maximum-entropy engine, as a default training and tagging engine. The jCarafe sequence tagger provides training and tagging for simple spans (i.e., span plus true or effective labels), while the maximum-entropy engine provides training and tagging for attributes with fixed values (e.g., choice attributes or boolean attributes). jCarafe also provides an English tokenizer and simple sentence tagger

Where to customize jCarafe
How jCarafe limits the capabilities of MAT
The tokenizer and sentence tagger
The CRF tagging engine
The CRF training engine
The maximum-entropy tagging engine
The maximum-entropy training engine

Where to customize jCarafe

The flags for configuring jCarafe are available in a number of locations in MAT.

The jCarafe training engines can be configured using the <build_settings> element in your task.xml file
The jCarafe training engines can be customized further using the <build_settings> element in your experiment XML file, or the <settings> element for the "modelbuild" operation in the <workspace> section of your task.xml file.
The jCarafe training engines can be customized by passing flags to the MATModelBuilder command-line tool.
The jCarafe tagging engines can be customized by passing flags to the MATEngine command-line tool, and to other places in the system where MATEngine is implicitly called, such as the <args> element in the experiment XML file.
The jCarafe tagging engines can be customized via the <run_settings> element in the <step> element in the <workflow> element in your task.xml file.
The jCarafe tagging engines can be customized by using the <settings> element for the "advance" operation in the <workspace> element of your task.xml file.

How jCarafe limits the capabilities of MAT

While the jCarafe engine only supports training and tagging for simple spans and fixed-value attributes, it is not limited to tasks which consist only of such annotations. The more complex annotations (e.g., relations) can be added exclusively by hand tagging, or by an engine other than jCarafe, which you can provide, as noted above. However, our expectation is that, in general, most installations will have only jCarafe available.

For instance, the jCarafe CRF engine can be used in a workflow step which adds an annotation set whose annotation labels have arbitrary attributes associated with them. However, the models that it builds will only learn the simple span, and only the simple spans will be added when the models are applied. The other attributes will have to be added by hand, or by another workflow step (perhaps a jCarafe maximum-entropy classifier step, if the attribute is fixed-value). So your task may be very complex, but jCarafe may only automatically process a subset of the task. So you wouldn't be able support a tag-a-little, learn-a-little loop, or the experiment harness, for the full task.

Here's a complete list of places where jCarafe alone limits MAT in the types of annotations it can deal with:

Hand annotation: no limitations.
Automated content tagging in workflows, workspaces, and the experiment engine: jCarafe will add only simple span annotations, or fixed-value attributes. jCarafe will not add additional attributes beyond these (if present), nor spanless annotations.
Model building in file mode, workspaces, and the experiment engine: jCarafe will build models only for simple span annotations, or fixed-value attributes. jCarafe will not build models for additional attributes beyond these (if present), nor spanless annotations.
UI document comparison: no limitations.
UI document alignment: simple span annotations only.
UI document reconciliation: no limitations.
Scoring: no limitations.
Workspace import, assignment and folder management: no limitations.
Transducing and reporting: no limitations.

The jCarafe documentation

MAT is distributed with the jCarafe documentation, which is located here if you're viewing this documentation via a Web server, or in third_party/install/jcarafe.../resources if you've received MAT as a zip file and run the installation script.

The tokenizer and sentence tagger

The jCarafe tokenizer engine wrapper is MAT.JavaCarafe.CarafeTokenizationStep. This class should be referenced in the <step_config> element of an <engine> in your task.xml file.

Options

Command line option	XML attribute	Value	Description
--heap_size <s>	heap_size	a heap size for the Java VM	The jCarafe tokenizer is a Java application which tokenizes batches of documents at a time, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--stack_size <s>	stack_size	a stack size for the Java VM	The jCarafe tokenizer is a Java application which tokenizes batches of documents at a time, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--handle_tags	handle_tags	"yes" (XML)	If present, treat the signal as XML and tokenize XML elements and entities as single tokens.
--tokenizer_patterns	tokenizer_patterns	a string	See the jCarafe docs on --tokenizer-patterns.
--sentence_label	sentence_label	a string	If present, preserve the sentence annotations coming out of the jCarafe tokenizer and map them to the specified label.
--jcarafe_args	jcarafe_args	a string	If present, a string containing arguments which will be passed directly to jCarafe. The string will be parsed using the Python shlex package, following POSIX command-line parsing rules as closely as possible. Note that all pathnames should be absolute.

The CRF tagging engine

The jCarafe conditional random field tagging engine wrapper is MAT.JavaCarafe.CarafeTagStep. This class should be referenced in the <step_config> element of an <engine> in your task.xml file.

The jCarafe tagger is sensitive to SEGMENTs. It will only insert annotations into SEGMENTs for the current step whose annotator = "MACHINE" or annotator = null. If no segments are found, it will insert annotations into all the zones; if no zones are found, it will insert annotations into the entire document. Any SEGMENT into which annotations are inserted will be marked annotator = "MACHINE".

Note that because the jCarafe tagger engine can be used in multiple steps, in any call to the processing engine the names of the jCarafe tagger options and attributes must receive a prefix. This prefix is the name of the step (not the pretty name) which uses jCarafe as its engine. This step name is preprocessed by replacing all non-alphanumeric sequences by a single underscore. These prefixes must be used in the following contexts:

the command-line options passed to an invocation to MATEngine
the <args> and <prep_args> elements in the experiment XML file
the <settings> element of any workspace commands which invoke the engine
the attribute specified in an <iterator> element in the experiment XML file

These prefixes must not be used in the following contexts:

the <run_settings> element within the <step> element within workflows in task.xml

Options

Command line option	XML attribute	Value	Description
--<step_name>_local	<step_name>_local	"yes" (XML)	By default, the MAT engine will contact the MAT Web server to tag a document, because the Web server has the capability of starting up and monitoring a long-living tagger task. The reason this is beneficial is that the jCarafe tagger, like many model-based taggers, has a fairly expensive startup cost. To block the engine from contacting the Web server, and force it to start up and shut down the tagger on its own, specify <step_name>_local="yes".
--<step_name>_model <model>	<step_name>_model	a string, a filename of a jCarafe model	If the task does not have a default model, the user must specify the location of the tagger model.
--<step_name>_prior_adjust	<step_name>_prior_adjust	a float	The jCarafe tagger can be biased toward recall or toward precision. This setting biases the jCarafe tagger to favor precision (positive values) or recall (negative values). Default is -1.0 (slight recall bias). Practical range of values is usually +-6.0.
--heap_size <s>	heap_size	a heap size for the Java VM	The jCarafe tagger is a Java application, and the default heap size may not be adequate for your model. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--stack_size <s>	stack_size	a stack size for the Java VM	The jCarafe tagger is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--<step_name>_tagging_pre_models <s>	<step_name>_tagging_pre_models	a string	If present, a comma-separated list of glob-style patterns specifying the models to include as pre-taggers. This is an advanced feature that normal users will not be using.
--<step_name>_add_tokens_internally	<step_name>_add_tokens_internally	"yes" (XML)	If present, jCarafe will use its internal tokenizer to tokenize the document before tagging. If your workflow doesn't tokenize the document, you must provide this flag, or jCarafe will have no tokens to base its tagging on. We recommend strongly that you tokenize your documents separately; you should not use this flag.
--<step_name>_capture_token_confidences	<step_name>_capture_token_confidences	"yes" (XML)	If present, jCarafe will capture token confidence metrics for later exploitation.
--<step_name>_capture_sequence_confidences	<step_name>_capture_sequence_confidences	"yes" (XML)	If present, jCarafe will capture sequence confidence metrics for later exploitation.
--<step_name>_parallel	<step_name>_parallel	"yes" (XML)	If present, parallelizes the decoding.
--<step_name>_nthreads	<step_name>_nthreads	an integer	If --<step_name>_parallel is used, controls the number of threads used for decoding.
--<step_name>_jcarafe_args	<step_name>_jcarafe_args	a string	If present, a string containing arguments which will be passed directly to jCarafe. The string will be parsed using the Python shlex package, following POSIX command-line parsing rules as closely as possible. Note that all pathnames should be absolute.

The CRF training engine

The jCarafe conditional random field training engine class is MAT.JavaCarafe.CarafeModelBuilder. You should reference this class in the <model_config> element of the <engine> in your task.xml file which references the jCarafe tagging engine.

The jCarafe trainer is sensitive to SEGMENTs. It will train on all SEGMENTs for the relevant step which have been which have been touched by a human annotator, whether or not they're gold (see, however, the --partial_training_on_gold_only option). If no SEGMENTs are found, it will use all the zones; if no zones are found, it will use the entire document. The scope of the training is important; any blank regions are treated as implicitly negative information (i.e., the trainer will conclude that there's no annotation there on purpose).

Basic options

There are only a few settings here that you should change on any regular basis:

Command line option	XML attribute	Value	Description
--lexicon_dir <dir>	lexicon_dir	a pathname	If present, the name of a directory which contains a jCarafe training lexicon. This pathname should be an absolute pathname, and should have a trailing slash. The content of the directory should be a set of files, each of which contains a sequence of tokens, one per line. The name of the file will be used as a training feature for the token. You can use this feature, for instance, to provide implicit part-of-speech information (e.g., create a file named ADJ which contains a sequence of words that are adjectives) or name information (e.g., create a file named NAME which contains a sequence of tokens which can occur in proper names). On the command line, overrides any possible default in the <build_settings> for the relevant model config in the task.xml file for the task. Note that the interpretation of the contents of the lexicons depends on the jCarafe feature specification.
--heap_size <s>	heap_size	a heap size for the Java VM	The jCarafe trainer is a Java application, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--stack_size <s>	stack_size	a stack size for the Java VM	The jCarafe trainer is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--parallel	parallel	"yes" (XML)	If present, parallelizes the training.
--nthreads	nthreads	an integer	If --parallel is used, controls the number of threads used for training.

Advanced options

The options in this section are documented here for completeness. If you're not familiar with the jCarafe training engine and its implementation, the chances are that you'll never use any of these values. If you want to use them, someone who is knowledgeable about jCarafe should set these values for you in task.xml, and unless you really know what you're doing, you should not override them on the command line.

jCarafe provides the option of using non-standard training methods. One of those methods is called periodic stepsize adjustment (PSA). This method, when used correctly, is significantly faster than the normal training mechanism. However, it sometimes performs less well in situations which are not yet clear. You might prefer to use it if you're doing comparative analysis of multiple models, or you're just starting off with a rough-and-ready system and you don't need to optimize on accuracy yet. The --max_iterations flag governs the number of training cycles; more is not necessarily better, because the engine may overfit to the data.

The documentation for the --feature_spec flag below refers to jCarafe feature spec files. The documentation for how to create these files can be found in the jCarafe documentation. Similarly, if you want details on the --gaussian_prior, --no_begin, --l1, and --l1_c flags, see the jCarafe documentation.

The documentation for --tags and --pre_models refers to an advanced feature of jCarafe where it can use tagging models to generate input features for multi-stage tagging. We will not discuss this advanced capability of jCarafe any further.

Command line option	XML attribute	Value	Description
--feature_spec <file>	feature_spec	a filename	Name of the file that contains the jCarafe feature specification. The default specification will be used if none is provided. If the filename is not an absolute filename, it will be interpreted relative to the directory of the task which is being trained for. (This is because this option more likely to be provided in your task.xml file rather than on the command line.) On the command line, optional if feature_spec is set in the <build_settings> for the relevant model config in the task.xml file for the task.
--training_method <meth>	training_method	"psa"	If present, specify a training method other than the standard method. Currently, the only recognized value is psa. The psa method is noticeably faster, but may result in somewhat poorer results. You can use a value of '' to override a previously specified training method (e.g., a default method in your task).
--max_iterations <num>	max_iterations	an integer	Number of iterations for the training mechanism to use. Current defaults are 200 for standard training, 10 for PSA training. On the command line, overrides any possible default in the <build_settings> for the relevant model config in the task.xml file for the task.
---tags <s>	tags	a string	If present, a comma-separated list of tags to pass to the training engine instead of the full tag set for the task (used to create per-tag pre-tagging models for multi-stage training and tagging).
--pre_models <s>	pre_models	a string	If present, a comma-separated list of glob-style patterns specifying the models to include as pre-taggers.
--gaussian_prior <f>	gaussian_prior	a float	A positive float, default is 10.0. See the jCarafe docs for details.
--no_begin	no_begin	"yes" (XML)	Don't introduce begin states during training. Useful if you're certain that you won't have any adjacent spans with the same label. See the jCarafe documentation for more details.
--l1	l1	"yes" (XML)	Use L1 regularization for PSA training. See the jCarafe docs for details.
--l1_c <f>	l1_c	a float	Change the penalty factor for the L1 regularizer. See the jCarafe docs for details.
--add_tokens_internally	add_tokens_internally	"yes" (XML)	If present, jCarafe will use its internal tokenizer to tokenize the document before training. If your workflow doesn't tokenize the document, you must provide this flag, or jCarafe will have no tokens to base its training on. We recommend strongly that you tokenize your documents separately; you should not use this flag.
--partial_training_on_gold_only	partial_training_on_gold_only	"yes" (XML)	When jCarafe is presented with partially tagged documents, by default MAT will ask jCarafe to train on all annotated segments, gold or not. If this flag is specified, only "human gold" or "reconciled" segments will be used for training.
--word_properties	word_properties	a string	see the jCarafe docs on --word-properties.
--word_scores	word_scores	a string	see the jCarafe docs on --word-scores.
--learning_rate	learning_rate	a string	see the jCarafe docs on --learning-rate.
--disk_cache	disk_cache	a string	see the jCarafe docs on --disk-cache.
--jcarafe_args	jcarafe_args	a string	If present, a string containing arguments which will be passed directly to jCarafe. The string will be parsed using the Python shlex package, following POSIX command-line parsing rules as closely as possible. Note that all pathnames should be absolute.

A note about feature specifications

jCarafe uses a declarative representation for describing the features it will use when training and tagging. The default feature specification file is found in resources/default.fspec in your jCarafe directory. The jCarafe docs describe the meaning of the contents of these files, and how to write your own if you so choose.

In some cases, the feature specification interacts with trainer parameters. For instance, the contents of the lexicon files provided by the --lexicon_dir option are interpreted according to features in the feature specification file, in particular the lexFn and downLexFn specs, as follows:

If you want your lexicons to match tokens in the text case-sensitively (i.e., exact string match) you should build your lexicons with the casing you expect in the text and include the "lexFn" spec in your spec file.
If you want the lexicons to match case-insensitively, you should include an all lower-case lexicon file and then include the "downLexFn" spec instead of "lexFn".

So, for example, if you're annotating names, and all the names appear with consistent capitalization in your documents, you can use "lexFn" and list the elements in your lexicon file using the appropriate capitalization. On the other hand, if the capitalization is inconsistent, then you should use the case-insensitive option.

The default feature spec file distributed with MAT contains a "lexFn" feature, for case-sensitive matching.

The maximum-entropy tagging engine

The jCarafe tagging engine wrapper is MAT.JavaCarafe.JCarafeMaxentClassifierTagStep. This class should be referenced in the <step_config> element of an <engine> in your task.xml file. Unlike the jCarafe tagger, it is not sensitive to SEGMENTs.

The maximum-entropy tagging and training wrappers are intended to work with choice and boolean attributes of pre-determined spans, such as sentences. Each span will be used to create features for the training and tagging to learn the values for these attributes.

Note that because the tagger engine can be used in multiple steps, in any call to the processing engine the names of the tagger options and attributes must receive a prefix. This prefix is the name of the step (not the pretty name) which uses the maximum-entropy tagger as its engine. This step name is preprocessed by replacing all non-alphanumeric sequences by a single underscore. These prefixes must be used in the following contexts:

the command-line options passed to an invocation to MATEngine
the <args> and <prep_args> elements in the experiment XML file
the <settings> element of any workspace commands which invoke the engine
the attribute specified in an <iterator> element in the experiment XML file

These prefixes must not be used in the following contexts:

the <run_settings> element within the <step> element within workflows in task.xml

Options

Command line option	XML attribute	Value	Description
--<step_name>_model <model>	<step_name>_model	a string, a filename of a jCarafe model	If the task does not have a default model, the user must specify the location of the tagger model.
--heap_size <s>	heap_size	a heap size for the Java VM	The jCarafe maxent tagger is a Java application, and the default heap size may not be adequate for your model. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--stack_size <s>	stack_size	a stack size for the Java VM	The jCarafe maxent tagger is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--<step_name>_jcarafe_args	<step_name>_jcarafe_args	a string	If present, a string containing arguments which will be passed directly to jCarafe. The string will be parsed using the Python shlex package, following POSIX command-line parsing rules as closely as possible. Note that all pathnames should be absolute.
--<step_name>_save_feature_vectors_for_scoring	<step_name>_save_feature_vectors_for_scoring	"yes" (XML)	If present, the feature vectors will be stashed in the MAT document for access during the scoring process (access not yet exploited)

The maximum-entropy training engine

The jCarafe maximum-entropy training engine class is MAT.JavaCarafe.JCarafeMaxentClassifierModelBuilder. You should reference this class in the <model_config> element of the <engine> in your task.xml file which references the jCarafe maximum-entropy tagging engine. Unlike the jCarafe CRF trainer, it is not sensitive to SEGMENTs.

Options

Command line option	XML attribute	Value	Description
--heap_size <s>	heap_size	a heap size for the Java VM	The jCarafe trainer is a Java application, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--stack_size <s>	stack_size	a stack size for the Java VM	The jCarafe trainer is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file.
--save_arff	save_arff	"yes" (XML)	If present, an ARFF file suitable for loading into Weka will be saved alongside the model.
--save_feature_vectors	save_feature_vectors	"yes" (XML)	If present, the feature vector file which is the input to the classifier will be saved alongside the model.
--gaussian_prior <f>	gaussian_prior	a float	A positive float, default is 10.0. See the jCarafe docs for details.
	feature_extractors	a comma-separated string of feature extractor functions	This attribute is only available at task configuration time, not on the command line. For more details, see below.
--jcarafe_args	jcarafe_args	a string	If present, a string containing arguments which will be passed directly to jCarafe. The string will be parsed using the Python shlex package, following POSIX command-line parsing rules as closely as possible. Note that all pathnames should be absolute.

Feature extractors

The jCarafe CRF training engine uses feature specification files to describe the features to use for training via a declarative language. This facility is not, unfortunately, available for the maximum-entropy training engine. This is both good and bad. On the one hand, it means that the task developer will have to explicitly specify the functions used to extract the features for each span to be classified, and write the desired functions if they don't already exist; on the other hand, it means that the feature extraction process is much more flexible. We've defined a few useful feature extractors for you to use; see src/MAT/lib/mat/python/MAT/AttributeClassifier.py for the implementation and how to define new ones.

Function name	Description
_bagOfWords	the set of unique tokens in the item
_caseNormalizedBagOfWords	the set of unique case-normalized tokens in the item
_weightedBagOfWords	the set of unique tokens in the item, weighted by count
_caseNormalizedWeightedBagOfWords	the set of unique case-normalized tokens in the item, weighted by count
_bigrams	the set of unique token bigrams in the item
_caseNormalizedBigrams	the set of unique case-normalized token bigrams in the item
_weightedBigrams	the set of unique token bigrams in the item, weighted by count
_caseNormalizedWeightedBigrams	the set of unique case-normalized token bigrams in the item, weighted by count
_trigrams	the set of unique token trigrams in the item
_caseNormalizedTriirams	the set of unique case-normalized token trigrams in the item
_weightedTrigrams	the set of unique token trigrams in the item, weighted by count
_caseNormalizedWeightedTrigrams	the set of unique case-normalized token trigrams in the item, weighted by count

These feature extractors must be declared when the engine is declared in the task, e.g.:

    <engine name='classifier_engine'>
      <model_config class='MAT.JavaCarafe.JCarafeMaxentClassifierModelBuilder'>
        <build_settings feature_extractors="_bagOfWords,_bigrams"/>
      </model_config>
      <step_config class='MAT.JavaCarafe.JCarafeMaxentClassifierTagStep'/>
    </engine>