Now that you've completed Tutorial 1, let's move on to how you might use your tagged documents to build a model. We're going to use the same simple named entity task that comes with MAT, and we're going to assume that your task is installed (see step 1 in Tutorial 1 if it isn't). Where Tutorial 1 involved the UI, this tutorial (and the next one) involves one of the command-line tools. Like Tutorial 1, we're going to do this tutorial in file mode. And because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.
As we saw in Tutorial 1, the sample task contains twenty raw
ASCII files in the directory
MAT_PKG_HOME/sample/ne/resources/data/raw. The sample task also
contains annotated versions of these files, in
MAT_PKG_HOME/sample/ne/resources/data/json. Rather than ask you to
hand-annotate all twenty of these documents, we'll use the
already-annotated versions to build a model.
The tool we're going to use here is MATModelBuilder.
In a shell:
Unix:
$ cd $MAT_PKG_HOME
$ bin/MATModelBuilder --task 'Named Entity' --model_file /tmp/ne_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATModelBuilder.cmd --task "Named Entity" --model_file %TMP%\ne_model \
--input_files "%CD%\sample\ne\resources\data\json\*.json"
Each call to the model builder requires a task, just as the UI
required in Tutorial 1. The --model_file directive tells the tool
where to save the model, and the --input_files directive tells the
tool which files to use. There are many other arguments available
to this tool; see the tool
documentation for more details.
When you run this, you should see something like the following
output:
Processed 259 sequences . . . beginning parameter estimation..
Number of features = 63787
Number of states = 7
About to train model...
Stochastic Gradient Descent Training (with PSA) over 259 instances
maxEpochs= 6; batchSize= 1; max_iters= 1554
The eta's are initialized to 0.1
p_alpha= 0.9999; p_beta= 0.99; n= 10; k= 0.95; big_m= 190.94999999999962; small_m= 0.01919191919191704
............Epoch 1 complete (of 6)
Log-likelihood for Epoch: 1878.0822126846201
.............Epoch 2 complete (of 6)
Log-likelihood for Epoch: 833.0760337758322
.............Epoch 3 complete (of 6)
Log-likelihood for Epoch: 676.2601786389712
.............Epoch 4 complete (of 6)
Log-likelihood for Epoch: 625.4939296057713
.............Epoch 5 complete (of 6)
Log-likelihood for Epoch: 604.3845314942303
.............Epoch 6 complete (of 6)
Log-likelihood for Epoch: 592.6296305922921
...Training completed in 1.688347 seconds
..There are 63787 total features: 0 have a zero weight and 63787 have a non-zero weight.
The default behavior of the model builder is specified in the task.xml file associated with this
task.
We've successfully built a model, but we're not going to use it
quite yet.
Our task has also been configured, in the task.xml file, to
recognize the location of a default
model. The default model is a location, usually a
relative pathname referring to the directory which contains the
task.xml file or one of its descendants, which is checked by
default when the MAT tools look for a model in file mode. The user
has the option of saving the model as the default model when
MATModelBuilder is called. Let's do that, so we can make use of
the default model in the next tutorial.
In a shell:
Unix:
$ cd $MAT_PKG_HOME
$ bin/MATModelBuilder --task 'Named Entity' --save_as_default_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATModelBuilder.cmd --task "Named Entity" --save_as_default_model \
--input_files "%CD%\sample\ne\resources\data\json\*.json"
The output you see should be similar to that in step 1.
If you're not planning on doing any other tutorials, and you
don't want the "Named Entity" task hanging around, remove it as
shown in the final step of Tutorial 1.
This concludes Tutorial 2.