While it's possible to use
your
own training and tagging engine in MAT, MAT provides
jCarafe, a CRF-based sequence tagger and maximum-entropy engine,
as a default training and tagging engine. The jCarafe sequence
tagger provides training and tagging for simple spans (i.e., span
plus true or effective labels), while the maximum-entropy engine
provides training and tagging for attributes with fixed values
(e.g., choice attributes or boolean attributes). jCarafe also
provides an English tokenizer and simple sentence tagger
The flags for configuring jCarafe are available in a number of
locations in MAT.
While the jCarafe engine only supports training and tagging for
simple spans and fixed-value attributes, it is not limited to
tasks which consist only of such annotations. The more complex
annotations (e.g., relations) can be added exclusively by hand
tagging, or by an engine other than jCarafe, which you can
provide, as noted above. However, our expectation is that, in
general, most installations will have only jCarafe available.
For instance, the jCarafe CRF engine can be used in a workflow step which adds an
annotation set whose annotation labels have arbitrary attributes
associated with them. However, the models that it builds will only
learn the simple span, and only the simple spans will be added
when the models are applied. The other attributes will have to be
added by hand, or by another workflow step (perhaps a jCarafe
maximum-entropy classifier step, if the attribute is fixed-value).
So your task may be very complex, but jCarafe may only
automatically process a subset of the task. So you wouldn't be
able support a tag-a-little, learn-a-little
loop, or the experiment
harness, for the full task.
Here's a complete list of places where jCarafe alone limits MAT
in the types of annotations it can deal with:
MAT is distributed with the jCarafe documentation, which is located here if you're viewing this documentation via a Web server, or in third_party/install/jcarafe.../resources if you've received MAT as a zip file and run the installation script.
The jCarafe tokenizer engine wrapper is
MAT.JavaCarafe.CarafeTokenizationStep. This class should be
referenced in the <step_config> element of an <engine>
in your task.xml file.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--heap_size <s> |
heap_size |
a heap size for the Java VM |
The jCarafe tokenizer is a Java application which tokenizes batches of documents at a time, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--stack_size <s> |
stack_size |
a stack size for the Java VM |
The jCarafe tokenizer is a Java application which tokenizes batches of documents at a time, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--handle_tags |
handle_tags |
"yes" (XML) |
If present, treat the signal
as XML and tokenize XML elements and entities as single
tokens. |
--tokenizer_patterns |
tokenizer_patterns |
a string |
See the jCarafe
docs on --tokenizer-patterns. |
--sentence_label |
sentence_label |
a string |
If present, preserve the sentence annotations
coming out of the jCarafe tokenizer and map them to the
specified label. |
--jcarafe_args |
jcarafe_args |
a string |
If present, a string containing arguments
which will be passed directly to jCarafe. The string will be
parsed using the Python shlex package, following POSIX
command-line parsing rules as closely as possible. Note that
all pathnames should be absolute. |
The jCarafe conditional random field tagging engine wrapper is MAT.JavaCarafe.CarafeTagStep. This class should be referenced in the <step_config> element of an <engine> in your task.xml file.
The jCarafe tagger is sensitive to SEGMENTs.
It will only insert annotations into SEGMENTs for the current step
whose annotator = "MACHINE" or annotator = null. If no segments
are found, it will insert annotations into all the zones; if no
zones are found, it will insert annotations into the entire
document. Any SEGMENT into which annotations are inserted will be
marked annotator = "MACHINE".
Note that because the jCarafe tagger engine can be used in
multiple steps, in any call to the processing engine the names of
the jCarafe tagger options and attributes must receive a prefix.
This prefix is the name of the step (not the pretty name)
which uses jCarafe as its engine. This step name is preprocessed
by replacing all non-alphanumeric sequences by a single
underscore. These prefixes must be used in the following contexts:
These prefixes must not be used in the following
contexts:
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--<step_name>_local |
<step_name>_local |
"yes" (XML) |
By default, the MAT engine
will contact the MAT Web server to tag a document, because
the Web server has the capability of starting up and
monitoring a long-living tagger task. The reason this is
beneficial is that the jCarafe tagger, like many model-based
taggers, has a fairly expensive startup cost. To block the
engine from contacting the Web server, and force it to start
up and shut down the tagger on its own, specify
<step_name>_local="yes". |
--<step_name>_model
<model> |
<step_name>_model |
a string, a filename of a
jCarafe model |
If the task does not have a
default model, the user must specify the location of the
tagger model. |
--<step_name>_prior_adjust |
<step_name>_prior_adjust |
a float |
The jCarafe tagger can be biased toward recall or toward precision. This setting biases the jCarafe tagger to favor precision (positive values) or recall (negative values). Default is -1.0 (slight recall bias). Practical range of values is usually +-6.0. |
--heap_size <s> |
heap_size |
a heap size for the Java VM |
The jCarafe tagger is a Java
application, and the default heap size may not be adequate
for your model. The value here is passed to the Java VM
using the -Xmx argument. Values like 512M or 2G are examples
of expected values. This setting overrides any equivalent
setting in the <java_subprocess_parameters> in the
task.xml file. |
--stack_size <s> |
stack_size |
a stack size for the Java VM |
The jCarafe tagger is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--<step_name>_tagging_pre_models
<s> |
<step_name>_tagging_pre_models |
a string |
If present, a comma-separated
list of glob-style patterns specifying the models to include
as pre-taggers. This is an advanced feature that normal
users will not be using. |
--<step_name>_add_tokens_internally |
<step_name>_add_tokens_internally |
"yes" (XML) |
If present, jCarafe will use
its internal tokenizer to tokenize the document before
tagging. If your workflow doesn't tokenize the document, you
must provide this flag, or jCarafe will have no tokens to
base its tagging on. We recommend strongly that you tokenize your documents
separately; you should not use this flag. |
--<step_name>_capture_token_confidences |
<step_name>_capture_token_confidences |
"yes" (XML) |
If present, jCarafe will
capture token confidence metrics for later exploitation. |
--<step_name>_capture_sequence_confidences |
<step_name>_capture_sequence_confidences |
"yes" (XML) |
If present, jCarafe will
capture sequence confidence metrics for later exploitation. |
--<step_name>_parallel |
<step_name>_parallel |
"yes" (XML) |
If present, parallelizes the
decoding. |
--<step_name>_nthreads |
<step_name>_nthreads |
an integer |
If
--<step_name>_parallel is used, controls the number of
threads used for decoding. |
--<step_name>_jcarafe_args |
<step_name>_jcarafe_args |
a string |
If present, a string containing arguments
which will be passed directly to jCarafe. The string will be
parsed using the Python shlex package, following POSIX
command-line parsing rules as closely as possible. Note that
all pathnames should be absolute. |
The jCarafe conditional random field training engine class is
MAT.JavaCarafe.CarafeModelBuilder. You should reference this class
in the <model_config> element of the <engine> in your
task.xml file which references the jCarafe tagging engine.
The jCarafe trainer is sensitive to SEGMENTs. It will train on all SEGMENTs for the relevant step which have been which have been touched by a human annotator, whether or not they're gold (see, however, the --partial_training_on_gold_only option). If no SEGMENTs are found, it will use all the zones; if no zones are found, it will use the entire document. The scope of the training is important; any blank regions are treated as implicitly negative information (i.e., the trainer will conclude that there's no annotation there on purpose).
There are only a few settings here that you should change on any
regular basis:
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--lexicon_dir <dir> |
lexicon_dir |
a pathname |
If present, the name of a
directory which contains a jCarafe training lexicon. This
pathname should be an absolute pathname, and should have a
trailing slash. The content of the directory should be a set
of files, each of which contains a sequence of tokens, one
per line. The name of the file will be used as a training
feature for the token. You can use this feature, for
instance, to provide implicit part-of-speech information
(e.g., create a file named ADJ which contains a sequence of
words that are adjectives) or name information (e.g., create
a file named NAME which contains a sequence of tokens which
can occur in proper names). On the command line, overrides any possible default in the <build_settings> for the relevant model config in the task.xml file for the task. Note that the interpretation of the contents of the lexicons depends on the jCarafe feature specification. |
--heap_size <s> |
heap_size |
a heap size for the Java VM |
The jCarafe trainer is a Java application, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--stack_size <s> |
stack_size |
a stack size for the Java VM |
The jCarafe trainer is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--parallel |
parallel |
"yes" (XML) |
If present, parallelizes the
training. |
--nthreads |
nthreads |
an integer |
If --parallel is used,
controls the number of threads used for training. |
The options in this section are documented here for completeness. If you're not familiar with the jCarafe training engine and its implementation, the chances are that you'll never use any of these values. If you want to use them, someone who is knowledgeable about jCarafe should set these values for you in task.xml, and unless you really know what you're doing, you should not override them on the command line.
jCarafe provides the option of using non-standard training
methods. One of those methods is called periodic stepsize adjustment (PSA). This method,
when used correctly, is significantly faster than the normal
training mechanism. However, it sometimes performs less well in
situations which are not yet clear. You might prefer to use it if
you're doing comparative analysis of multiple models, or you're
just starting off with a rough-and-ready system and you don't need
to optimize on accuracy yet. The --max_iterations flag governs the
number of training cycles; more is not necessarily better, because
the engine may overfit to the data.
The documentation for the --feature_spec flag below refers to
jCarafe feature spec files. The documentation for how to create
these files can be found in the jCarafe documentation.
Similarly, if you want details on the --gaussian_prior,
--no_begin, --l1, and --l1_c flags, see the jCarafe documentation.
The documentation for --tags and --pre_models refers to an
advanced feature of jCarafe where it can use tagging models to
generate input features for multi-stage tagging. We will not
discuss this advanced capability of jCarafe any further.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--feature_spec <file> |
feature_spec |
a filename |
Name of the file that
contains the jCarafe feature
specification. The default specification will be used
if none is provided. If the filename is not an absolute
filename, it will be interpreted relative to the directory
of the task which is being trained for. (This is because
this option more likely to be provided in your task.xml file
rather than on the command line.) On the command line, optional if feature_spec is set in the <build_settings> for the relevant model config in the task.xml file for the task. |
--training_method
<meth> |
training_method |
"psa" |
If present, specify a
training method other than the standard method. Currently,
the only recognized value is psa. The psa method is
noticeably faster, but may result in somewhat poorer
results. You can use a value of '' to override a previously
specified training method (e.g., a default method in your
task). |
--max_iterations <num> |
max_iterations |
an integer |
Number of iterations for the
training mechanism to use. Current defaults are 200 for
standard training, 10 for PSA training. On the command line,
overrides any possible default in the <build_settings>
for the relevant model config in the task.xml file for the
task. |
---tags <s> |
tags |
a string |
If present, a comma-separated
list of tags to pass to the training engine instead of the
full tag set for the task (used to create per-tag
pre-tagging models for multi-stage training and tagging). |
--pre_models <s> |
pre_models |
a string |
If present, a comma-separated
list of glob-style patterns specifying the models to include
as pre-taggers. |
--gaussian_prior <f> |
gaussian_prior |
a float |
A positive float, default is
10.0. See the jCarafe docs for details. |
--no_begin |
no_begin |
"yes" (XML) |
Don't introduce begin states
during training. Useful if you're certain that you won't
have any adjacent spans with the same label. See the jCarafe documentation
for more details. |
--l1 |
l1 |
"yes" (XML) |
Use L1 regularization for PSA
training. See the jCarafe
docs for details. |
--l1_c <f> |
l1_c |
a float |
Change the penalty factor for
the L1 regularizer. See the jCarafe docs for
details. |
--add_tokens_internally |
add_tokens_internally |
"yes" (XML) |
If present, jCarafe will use
its internal tokenizer to tokenize the document before
training. If your workflow doesn't tokenize the document,
you must provide this flag, or jCarafe will have no tokens
to base its training on. We recommend strongly that you
tokenize your documents separately; you should not use this
flag. |
--partial_training_on_gold_only |
partial_training_on_gold_only |
"yes" (XML) |
When jCarafe is presented
with partially
tagged documents, by default MAT will ask jCarafe to
train on all annotated segments, gold or not. If this flag
is specified, only "human gold" or "reconciled" segments
will be used for training. |
--word_properties |
word_properties |
a string |
see the jCarafe
docs on --word-properties. |
--word_scores |
word_scores |
a string |
see the jCarafe
docs on --word-scores. |
--learning_rate |
learning_rate |
a string |
see the jCarafe
docs on --learning-rate. |
--disk_cache |
disk_cache |
a string |
see the jCarafe
docs on --disk-cache. |
--jcarafe_args |
jcarafe_args |
a string |
If present, a string containing arguments
which will be passed directly to jCarafe. The string will be
parsed using the Python shlex package, following POSIX
command-line parsing rules as closely as possible. Note that
all pathnames should be absolute. |
jCarafe uses a declarative representation for describing the
features it will use when training and tagging. The default
feature specification file is found in resources/default.fspec in
your jCarafe directory. The jCarafe docs
describe the meaning of the contents of these files, and how to
write your own if you so choose.
In some cases, the feature specification interacts with trainer
parameters. For instance, the contents of the lexicon files
provided by the --lexicon_dir option are interpreted according to
features in the feature specification file, in particular the
lexFn and downLexFn specs, as follows:
So, for example, if you're annotating names, and all the names
appear with consistent capitalization in your documents, you can
use "lexFn" and list the elements in your lexicon file using the
appropriate capitalization. On the other hand, if the
capitalization is inconsistent, then you should use the
case-insensitive option.
The default feature spec file distributed with MAT contains a
"lexFn" feature, for case-sensitive matching.
The jCarafe tagging engine wrapper is
MAT.JavaCarafe.JCarafeMaxentClassifierTagStep. This class should
be referenced in the <step_config> element of an
<engine> in your task.xml file. Unlike the jCarafe tagger,
it is not sensitive to SEGMENTs.
The maximum-entropy tagging and training wrappers are intended to
work with choice and boolean attributes of pre-determined spans,
such as sentences. Each span will be used to create features
for the training and tagging to learn the values for these
attributes.
Note that because the tagger engine can be used in multiple
steps, in any call to the processing engine the names of the
tagger options and attributes must receive a prefix. This prefix
is the name of the step (not the pretty name) which uses
the maximum-entropy tagger as its engine. This step name is
preprocessed by replacing all non-alphanumeric sequences by a
single underscore. These prefixes must be used in the following
contexts:
These prefixes must not be used in the following
contexts:
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--<step_name>_model
<model> |
<step_name>_model |
a string, a filename of a
jCarafe model |
If the task does not have a
default model, the user must specify the location of the
tagger model. |
--heap_size <s> |
heap_size |
a heap size for the Java VM |
The jCarafe maxent tagger is
a Java application, and the default heap size may not be
adequate for your model. The value here is passed to the
Java VM using the -Xmx argument. Values like 512M or 2G are
examples of expected values. This setting overrides any
equivalent setting in the <java_subprocess_parameters>
in the task.xml file. |
--stack_size <s> |
stack_size |
a stack size for the Java VM |
The jCarafe maxent tagger is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--<step_name>_jcarafe_args |
<step_name>_jcarafe_args |
a string |
If present, a string containing arguments
which will be passed directly to jCarafe. The string will be
parsed using the Python shlex package, following POSIX
command-line parsing rules as closely as possible. Note that
all pathnames should be absolute. |
--<step_name>_save_feature_vectors_for_scoring |
<step_name>_save_feature_vectors_for_scoring |
"yes" (XML) |
If present, the feature vectors will be
stashed in the MAT document for access during the scoring
process (access not yet exploited) |
The jCarafe maximum-entropy training engine class is
MAT.JavaCarafe.JCarafeMaxentClassifierModelBuilder. You should
reference this class in the <model_config> element of the
<engine> in your task.xml file which references the jCarafe
maximum-entropy tagging engine. Unlike the jCarafe CRF trainer, it
is not sensitive to SEGMENTs.
Command line option |
XML attribute |
Value |
Description |
---|---|---|---|
--heap_size <s> |
heap_size |
a heap size for the Java VM |
The jCarafe trainer is a Java application, and the default heap size may not be adequate for your data. The value here is passed to the Java VM using the -Xmx argument. Values like 512M or 2G are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--stack_size <s> |
stack_size |
a stack size for the Java VM |
The jCarafe trainer is a Java application, and the default stack size may not be adequate for your data. The value here is passed to the Java VM using the -Xss argument. Values like 4096k or 512k are examples of expected values. This setting overrides any equivalent setting in the <java_subprocess_parameters> in the task.xml file. |
--save_arff |
save_arff |
"yes" (XML) |
If present, an ARFF file
suitable for loading into Weka will be saved alongside the
model. |
--save_feature_vectors |
save_feature_vectors |
"yes" (XML) |
If present, the feature
vector file which is the input to the classifier will be
saved alongside the model. |
--gaussian_prior <f> |
gaussian_prior |
a float |
A positive float, default is
10.0. See the jCarafe
docs for details. |
feature_extractors |
a comma-separated string of feature extractor
functions |
This attribute is only available at task
configuration time, not on the command line. For more
details, see below. |
|
--jcarafe_args |
jcarafe_args |
a string |
If present, a string containing arguments
which will be passed directly to jCarafe. The string will be
parsed using the Python shlex package, following POSIX
command-line parsing rules as closely as possible. Note that
all pathnames should be absolute. |
The jCarafe CRF training engine uses feature
specification files to describe the features to use for
training via a declarative language. This facility is not,
unfortunately, available for the maximum-entropy training engine.
This is both good and bad. On the one hand, it means that the task
developer will have to explicitly specify the functions used to
extract the features for each span to be classified, and write the
desired functions if they don't already exist; on the other hand,
it means that the feature extraction process is much more
flexible. We've defined a few useful feature extractors for you to
use; see src/MAT/lib/mat/python/MAT/AttributeClassifier.py for the
implementation and how to define new ones.
Function name |
Description |
---|---|
_bagOfWords |
the set of unique tokens in the item |
_caseNormalizedBagOfWords |
the set of unique case-normalized tokens in
the item |
_weightedBagOfWords |
the set of unique tokens in the item,
weighted by count |
_caseNormalizedWeightedBagOfWords |
the set of unique case-normalized tokens in
the item, weighted by count |
_bigrams |
the set of unique token bigrams in the item |
_caseNormalizedBigrams |
the set of unique case-normalized token bigrams in the item |
_weightedBigrams |
the set of unique token bigrams in the item, weighted by count |
_caseNormalizedWeightedBigrams |
the set of unique case-normalized token bigrams in the item, weighted by count |
_trigrams |
the set of unique token trigrams in the item |
_caseNormalizedTriirams |
the set of unique case-normalized token trigrams in the item |
_weightedTrigrams |
the set of unique token trigrams in the item, weighted by count |
_caseNormalizedWeightedTrigrams |
the set of unique case-normalized token trigrams in the item, weighted by count |
These feature extractors must be declared when the engine is
declared in the task, e.g.:
<engine name='classifier_engine'>
<model_config class='MAT.JavaCarafe.JCarafeMaxentClassifierModelBuilder'>
<build_settings feature_extractors="_bagOfWords,_bigrams"/>
</model_config>
<step_config class='MAT.JavaCarafe.JCarafeMaxentClassifierTagStep'/>
</engine>