Tutorial 8: Mixed-initiative annotation of a multi-step annotation task

In Tutorial 7, we learned how to do hand annotation for complex annotations. In this tutorial we'll learn how to do annotation for a task with multiple steps, where later steps may depend on the completion of prior steps. We'll also learn how you can take advantage of mixed-initiative annotation (tag-a-little, learn-a-little) in any or all of the steps. If you're not already familiar with tag-a-little, learn-a-little annotation, it would be helpful to review that page before beginning this tutorial.

Like tutorial 7, we're going to do this tutorial in file mode. We're going to assume that the sample task directory is installed (see step 1 in Tutorial 1 if it isn't).

Step 1: Start up the Web server and UI

See the section on starting the UI in Tutorial 1.

Step 2: Learn about the Sample Relations task

The "Sample Relations" task also builds on the sample "Named Entity" task used in tutorials 1 through 6. The "Named Entity" task includes the following entity labels:

PERSON: the name of a person (e.g., "Hilary Clinton")
LOCATION: the name of politically or geographically defined location (e.g., "Poland", "Asia") -- adjectival forms are not tagged (see NATIONALITY below)
ORGANIZATION: the name of a corporate, governmental, or other organizational entity (e.g., "Yankees", "Microsoft", "Department of Justice")

When you installed the "sample/ne" task folder, it also included the "Enhanced Named Entity" task used in tutorial 7, and the "Sample Relations" task that we will use in this tutorial. The "Sample Relations" task adds one additional entity:

NATIONALITY: the name (singular or plural) of a nationality (e.g., "American", "North Koreans")

It also adds two spanless relations (Tutorial 7 covered creation and manipulation of spanless annotations):

Employment: Captures the relationship between PERSONs and their employers (which may be an ORGANIZATION, or a LOCATION or NATIONALITY)
Located: Captures the (past, present or future) physical LOCATION of a PERSON , or the relationship between an ORGANIZATION and the LOCATION or NATIONALITY where it is located, based, or does business

We are going to be working with the "Mixed Initiative Annotation" workflow, which defines three sequential mixed-initiative tag steps:

entity_tag
nationality_tag
relation_tag

The following excerpt from the task definition shows how these steps are defined and incorporated into the "Mixed Initiative Annotation" workflow:

    <steps>
      <annotation_step engine='carafe_tag_engine' sets_added='entities'
                       type='mixed' name='entity_tag'/>
      <annotation_step engine='carafe_tag_engine' sets_added='nationality'
                       type='mixed' name='nationality_tag'/>
      <annotation_step engine='trivial_relation_tag_engine' sets_added='relations'
                       type='mixed' name='relation_tag'/>
	
	...

    </steps>

    <workflows>
      <workflow name='Mixed Initiative Annotation' undoable="yes">
        <step pretty_name='zone' name='whole_zone'/>
        <step pretty_name='tokenize' name='carafe_tokenize'/>
        <step name='entity_tag' pretty_name='tag entities' type='mixed'/>
        <step name='nationality_tag' pretty_name='tag nationalities' type='mixed'/>
        <step name='relation_tag' pretty_name='tag relations' type='mixed'/>
      </workflow>

	...

    <workflows>

You can review the entire task definition in MAT_PKG_HOME/sample/ne/task.xml. There are three tasks defined in that file; the "Sample Relations" task is the last one.

MAT comes with the jCarafe tagging engine. We have also added a "trivial" relation tagger to MAT_PKG_HOME/sample/ne/resources/python primarily for the purposes of being able to demonstrate mixed-initiative annotation for relation tagging steps. However, this isn't an accurate tagger that you'd want to use for real work. It is possible to use other engines you supply in your tasks, although the details of how to do that are currently not documented.

Step 3: Create your working directory

We're going to assume a scenario in which we have ten documents (voa1 through voa10) tagged with named entity data, and ten more (voa11 through voa20) with no annotations, and we want to end up with all 20 documents tagged with all of the types of annotations the Sample Relations task supports.

If you want to work in file mode, you should create a working directory somewhere convenient and copy into it the files MAT_PKG_HOME/sample/ne/resources/data/json/voa1.txt.json through voa10.txt.json, as well as MAT_PKG_HOME/sample/ne/resources/data/raw/voa11.txt through voa20.txt.

Step 4: Build a model for the first step (tag entities)

We're going to start by using mixed-intiative annotation to add PERSON, ORGANIZATION and LOCATION tags to the ten unannotated documents.

First we'll build a model for the "entity_tag" step based on the ten documents we already have tagged, and use that model to pre-tag the remaining documents.

We learned to build a model for a simple task in tutorial 2. Since our current task has multiple trainable steps, we must also specify the step for which we want to train a model. Here we'll ask the model builder to build a model for the "Sample Relations" task's "entity_tag" step. In a shell:

Unix:

$ cd $MAT_PKG_HOME
$ bin/MATModelBuilder --task 'Sample Relations' --step 'entity_tag' \
--save_as_default_model --input_files "YOUR_WORKING_DIRECTORY/*.json"

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATModelBuilder.cmd --task "Sample Relations" --step "entity_tag" \
--save_as_default_model --input_files "YOUR_WORKING_DIRECTORY\*.json"

Step 5: Autotag and hand-correct entity tags in the UI

Let's go ahead and load a document.

In the UI, select File -> Open file... . You'll see a popup window.
Set the task dropdown menu to "Sample Relations".
Leave the language dropdown set to "English".
Select "Mixed Initiative Annotation" as your workflow.
Press the "Browse" button next to "Input", and navigate to your working directory that you set up in step 3. Select voa11.txt.
Select "raw", as your document type; the encoding will automatically be switched to utf-8.
Press the "Open" button, which should be active as soon as you select the input file.

Your window will look like this:

In the "Controls" the panel on the right, you will see that this file has been opened for the "Sample Relations" task in the "Mixed Initiative Annotation" workflow. Below the workflow is a button with a gear and a triangle facing toward the right. If you hover over that button it will tell you that clicking it will "Apply zone". Do that to apply the required "zone" step to the document. A new button with a left-facing triangle will appear, which would allow you to undo the zone step. Hovering over the gear button will now tell you that clicking it will "Apply tokenize". Do it. Now the "Controls" panel should look like this:

Hand annotation is now available in the text pane. If you wanted you could annotate the PERSON, ORGANIZATION and LOCATION entities entirely by hand. (Note that you cannot add any other type of annotation at this point -- your annotation is constrained by your current step.) But first, let's autotag the entities using the default model that we just created. Hovering over the gear button tells us that clicking it will "Apply tag entities". Let's do that. Now the document contains jCarafe's proposed entity tags, which you should correct by hand in the UI using the hand annotation skills you learned in tutorial 1. After you are happy with the annotations, use the button with the writing hand on it to indicate that you are finished with hand annotation for this step. After that you should save the file (as in step 6 of tutorial 1) and close it. (Although you could go on to do nationality tagging in that file, we're going to do entity tagging for all the documents first before moving on to nationality tagging. That way we can use the mixed-initiative loop for each step.)

At any point after correcting some of the autotagged files, you can build a new model, which will take into account your corrections and hopefully do a little better job of autotagging the remaining documents. Make sure that you don't have any files open in the UI when you create your model. Once you have finished autotagging and correcting all of the documents, we will move on to the nationality tagging step. (If you want to take a shortcut, you can copy the already entity-tagged sample copies of the remaining documents from MAT_PKG_HOME/sample/ne/resources/data/json/ to your working directory.)

Step 6: Add the nationality tags

Now we're going to do almost the same thing we did with entity tags with our nationality tags. But because we don't have any documents that are already tagged with nationalities, we'll have to tag a few by hand before we can build a model.

Use File->Open File... to select voa1.txt.json from your working directory. Be sure to set the document type to mat-json. MAT should remember your selections for task, language and workflow, so you should not need to set those again. Since the entity tagging step is done in this document (and all the others) you will be in hand annotation mode for the nationality step. If we had a model for nationality tagging we could apply it at this time, but since we don't, we'll proceed with hand annotation. You will only be able to add NATIONALITY tags. Find the instances of nationalities, such as "North Korean" and tag them. As a shortcut, once you've added one instance of a nationality such as "North Korean" you can click on that annotation and choose "Autotag matches" from the context menu. A popup will tell you how many matches it tagged, if any. For "North Korean" it will autotag 5 matches. (Be careful with sub-strings -- you would not want to autotag "Korean" until after tagging all instances of "North Korean" and "South Korean" first.) When you've finished tagging the nationalities in this file, use the hand button to indicate that you are done with hand annotation, save and close the file. Repeat with voa2 through voa5. Be sure to tell MAT that you are done annotating nationalities in each file by using the hand button before saving the file.

Now build a default model for the "nationality_tag" step, using the files voa1 through voa5 as input. Again, remember that you should not have any files open in the UI when you create your model. In a shell:

Unix:

$ cd $MAT_PKG_HOME
$ bin/MATModelBuilder --task 'Sample Relations' --step 'nationality_tag' \
--save_as_default_model --input_files "YOUR_WORKING_DIRECTORY/voa[1-5].txt.json"

Windows native:

> cd %MAT_PKG_HOME%
> bin\MATModelBuilder.cmd --task "Sample Relations" --step "nationality_tag" \
--save_as_default_model --input_files "YOUR_WORKING_DIRECTORY\voa[1-5].txt.json"

Now as you open the remaining files for the nationality tagging step, you can apply the automatic tags first and then correct them, as we did for the entity tags. Since there are very few nationality tag examples in the text, the model won't have learned much, and you'll still have to do a lot of the work by hand. You can improve the model as you go along by re-building after you have corrected some more files, but even with 20 files, it's never going to do a great job of nationality tagging.

Step 7: Add the relation tags

And one more time for the relation tags. At this point you would begin by hand-annotating some example files with the relations. We can cheat by copying over some gold standard documents with all three sets of annotations from MAT_PKG_HOME/sample/ne/resources/data/json_with_relations/. Create a model as above, replacing the step with "relation_tag". In this instance instead of training the carafe engine, we are training the trivial relation tagger instead of jCarafe, as specified in the task.xml specification. Again you should proceed by opening each file, running the automatic tag step and doing the corrections by hand, then marking hand-annotation done and saving/closing the file.

Step 8: Shut down the Web server

Shut down your Web server by typing "exit" in the window where you started the Web server. More details here.

Step 9: Clean up (optional)

If you're not planning on doing any other tutorials, and you don't want the sample tasks hanging around, remove them as follows:

Unix:

% cd $MAT_PKG_HOME
% bin/MATManagePluginDirs remove $PWD/sample/ne

Windows native:

> cd %MAT_PKG_HOME%%
> bin\MATManagePluginDirs.cmd remove %CD%\sample\ne

This concludes Tutorial 8.