Tutorial 4: Use the command-line engine and scorer

In Tutorial 2, you used a command-line tool to build a model. In this tutorial, we'll use the command-line engine to process documents using, among other things, the model you built. We're going to use the same simple named entity task that comes with MAT, and we're going to assume that your task is installed. Like Tutorial 2, we're going to do this tutorial in file mode. And because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.

In this tutorial, we're going to make use of the models we build in Tutorial 2, and we're also going to use MATEngine.

Step 1: Review important command-line arguments

First, let's review some of the arguments to MATEngine.

In a shell:

Unix:

% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity'

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity"

The --task directive is the first, and most important, directive. Unless you only have one task installed, you'll always need it. But you'll need more, and this is some of what you'll see (we've edited the help message for this example; see the MATEngine page for examples and complete documentation).

Error: workflow must be specified
Usage: MATEngine [options] ...

Named Entity :
  available workflows:
    Hand annotation : steps zone, tokenize, tag
    Review/repair : steps 
    Demo : steps zone, tokenize, tag

  Input options:
    --input_file=file   The file to process. Either this or --input_dir must
                        be specified. A single dash ('-') will cause the engine to read
                        from standard input.
    --input_dir=dir     The directory to process. Either this or --input_file
                        must be specified.
    --input_encoding=encoding
                        Input character encoding for raw files. Default is
                        utf-8.
    --input_file_type=raw | mat-json
                        The file type of the input. Either raw (a raw file) or
                        mat-json (a rich JSON file produced as the output of
                        this engine or the annotation tool). Required.

  Output options:
    --output_file=file  Where to save the output. Optional. Must be paired
                        with --input_file. A single dash ('-') will cause the engine to
                        write to standard output.
    --output_dir=dir    Where to save the output. Optional. Must be paired
                        with --input_dir.
    --output_fsuff=suffix
                        The suffix to add to each filename when --output_dir
                        is specified. If absent, the name of each file will be
                        identical to the name of the file in the input
                        directory.
    --output_file_type=raw | mat-json
                        The type of the file to save. Either raw (a raw file)
                        or mat-json (a rich JSON file). Required if either
                        --output_file or --output_dir is specified.
    --output_encoding=encoding
                        Output character encoding for raw files. Default is
                        utf-8.

  Task options:
    --workflow=workflow
                        The name of a workflow, as specified in some task.xml
                        file. Required if more than one workflow is available.
                        See above for available workflows.
    --steps=step,step,...
                        Some ordered subset of the steps in the specified
                        workflow. The steps should be concatenated with a
                        comma. See above for available steps.
    --undo_through=step
                        A step in the current workflow. All possible steps
                        already done in the document which follow this step
                        are undone, including this step, before any of the
                        steps in --steps are applied. You can use this flag in
                        conjunction with --steps to rewind and then reapply
                        operations.

The input and output options should be self-explanatory. All raw files require an encoding to be specified, which defaults to UTF-8 if not provided. Input and output both require a file type, which will be one of the MAT readers and writers. For the purposes of this tutorial, the only ones you need to know about are "raw" and "mat-json" (as shown above).

At the top of the help message, you'll see a listing for the "Named Entity" task, showing you the named workflows and the steps in each workflow. The step is the basic unit, and steps are ordered in workflows. In order to do anything with the MATEngine, you need to specify a workflow and some set of steps. For now, that's all you need to know; the documentation on tasks, workflows and steps provides more detail, as does the documentation on the sample 'Named Entity' task.

Step 2: Prepare a document for tagging

Back in Tutorial 1, we used the UI to prepare a document for hand tagging, because it was less complex than using the command-line engine. Now, we'll show you how to prepare a document for tagging (either hand tagging or automated tagging) on the command line.

In order to prepare a document for tagging, you can use the "Demo" workflow in the Named Entity task (the meanings of the workflows and steps may be different in other tasks). In this task, "zone" marks the appropriate taggable regions in the document, and "tokenize" identifies the word units in the document (because the annotation and training engine uses words as its basic elements). Let's prepare our raw document voa2.txt:

Unix:

% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity' --workflow Demo --steps 'zone,tokenize' \
--input_file $PWD/sample/ne/resources/data/raw/voa2.txt --input_file_type raw \
--output_file ./voa2_txt.json --output_file_type mat-json
zone : voa2.txt
tokenize : voa2.txt

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "zone,tokenize" \
--input_file %CD%\sample\ne\resources\data\raw\voa2.txt --input_file_type raw \
--output_file %CD%\voa2_txt.json --output_file_type mat-json
zone : voa2.txt
tokenize : voa2.txt

So what we did here was apply the zone and tokenize steps, in the Demo workflow in the "Named Entity" task, to the raw input file voa2.txt, saving the result as a rich annotated document voa2_txt.json. Notice that the command reports which steps it's applying.

Note that we can do multiple steps in the same invocation of MATEngine; the only reason we're preparing the document separately from tagging it is for illustration.

If you want to review this document, the easiest way is to load it into the UI; it should be identical to the output of step 2 in Tutorial 3. You can also examine it in your favorite editor, but it'll be fairly difficult to read, even if you're familiar with the MAT JSON annotated file format.

Step 3: Tag the document

In the same workflow, we'll now perform the "tag" step on the file we just created.

First, let's see what happens when we try to zone and tokenize the document again:

Unix:

% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity' --workflow Demo --steps 'zone,tokenize' \
--input_file ./voa2_txt.json --input_file_type mat-json --output_file ./voa2_txt.json \
--output_file_type mat-json

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "zone,tokenize" \
--input_file %CD%\voa2_txt.json --input_file_type mat-json --output_file %CD%\voa2_txt.json \
--output_file_type mat-json

You'll notice that the engine reports nothing, because the input annotated document records the annotation sets which have been added, and the document won't repeat the steps that add them.

Next, let's review some more of the command line options (again, we've edited down the options for the purposes of this discussion):

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task 'Named Entity' --workflow Demo

Windows native:

> %MAT_PKG_HOME%\bin\MATEngine.cmd --task "Named Entity" --workflow Demo

Usage: MATEngine [options] ...


  Options for step 'tokenize' (workflows Hand annotation, Align, Demo):
    --heap_size=HEAP_SIZE
                        If present, specifies the -Xmx argument for the Java JVM

  Options for step 'tag' (workflows Demo):
    See also --heap_size in Options for step 'tokenize' (workflows Hand annotation, Align, Demo)

    --carafe_tag_tagging_pre_models=TAGGING_PRE_MODELS
                        if present, a comma-separated list of glob-style patterns specifying the models to include as pre-
                        taggers.
    --carafe_tag_local  don't try to contact a remote tagger server; rather, start up a local command.
    --carafe_tag_model=TAGGER_MODEL
                        provide a tagger model file. Obligatory if no model is specified in the task step.
    --carafe_tag_prior_adjust=PRIOR_ADJUST
                        Bias the jCarafe tagger to favor precision (positive values) or recall (negative values). Default is
                        -1.0 (slight recall bias). Practical range of values is usually +-6.0.

We can control the "tag" step with the command line options shown here. Right now, the option we're interested in is --carafe_tag_local, because we don't want the engine to try to contact the Web server to tag the document. In this step, we're going to take advantage of the fact that we built a default model in Tutorial 2.

Unix:

% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity' --workflow Demo --steps 'tag' --input_file ./voa2_txt.json \
--input_file_type mat-json --output_file ./voa2_txt.json --output_file_type mat-json \
--carafe_tag_local
tag : voa2_txt.json

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "tag" --input_file %CD%\voa2_txt.json \
--input_file_type mat-json --output_file %CD%\voa2_txt.json --output_file_type mat-json \
--carafe_tag_local
tag : voa2_txt.json

Notice that it reports that the tag step is performed.

What happens if you try to repeat this command? MAT "knows" that the annotations added by the previous application of the "tag" step were added by an automated engine, and have not yet been hand-corrected. It also knows that the engine which applies the "tag" step is retaggable; you might want to do this if you changed your model, for instance. So with those two bits of information, MAT removes the previously added annotations and repeats the step. (This didn't happen with the zone and tokenize steps because there's no reason for either engine to be retaggable, so they aren't.)

If you load this document into the UI, you'll see that it looks identical to the output of step 3 in Tutorial 3.

Step 4: Undo and redo

If the workflow is undoable (not all of them are - more about this later), you can undo steps and redo them in the same command. Let's say, as in step 4 in Tutorial 3, you've applied the model, but you haven't done any hand correction, and you've subsequently prepared a more accurate default model, and you want to undo the automated tagging and apply the newer model.

You can use the --undo_through directive to achieve this. In addition, we're going to use the other model you built, in step 1 of Tutorial 2. The --undo_through directive can be used in conjunction with --steps; the undo will apply first.

Unix:

% cd $MAT_PKG_HOME
% bin/MATEngine --task 'Named Entity' --workflow Demo --steps 'tag' --input_file ./voa2_txt.json \
--input_file_type mat-json --output_file ./voa2_txt.json --output_file_type mat-json \
--carafe_tag_local --carafe_tag_model /tmp/ne_model --undo_through tag
tag : voa2_txt.json

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "tag" --input_file %CD%\voa2_txt.json \
--input_file_type mat-json --output_file %CD%\voa2_txt.json --output_file_type mat-json \
--carafe_tag_local --carafe_tag_model %TMP%\ne_model --undo_through tag
tag : voa2_txt.json

The --carafe_tag_model directive allows us to specify an explicit model to use, and the --undo_through directive undoes all the steps through the step listed. You'll notice that if you omit --undo_through, nothing will happen (because the document is already tagged), but with --undo_through, the document is tagged again (because --undo_through happens before --steps).

Step 5: Run the scorer

Recall that we have a version of this file which has already been tagged. We can treat that version as the reference file, and this version we just tagged as the hypothesis file, and run the scoring tool:

Unix:

% cd $MAT_PKG_HOME
% bin/MATScore --task "Named Entity" --file ./voa2_txt.json \
--ref_file ./sample/ne/resources/data/json/voa2.txt.json

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATScore.cmd --task "Named Entity" --file %CD%\voa2_txt.json \
--ref_file %CD%\sample\ne\resources\data\json\voa2.txt.json

The scorer will print a table to standard output describing the precision, recall, and F-measure at the tag level for this file comparison. The scorer has a large range of options; see the documentation for MATScore for details and examples.

Step 6: Clean up (optional)

If you're not planning on doing any other tutorials, and you don't want the "Named Entity" task hanging around, remove it as shown in the final step of Tutorial 1.

This concludes Tutorial 4.