The sample tasks

The sample tasks can be found in MAT_PKG_HOME/sample/ne. This directory, like all task directories, has a file named task.xml at its root. The format of this file is described in the task XML documentation. It has a "python" subdirectory to implement a Python engine for the Sample Relations task below, but no Javascript customizations (so no "js" subdirectory). See "Creating a task" for a description of the subdirectory structure of the task.

This task.xml file contains three tasks: "Named Entity", "Enhanced Named Entity" and "Sample Relations". The first task is a simple span task; it contains spanned annotations  without any complex attribute structure. This task is used for Tutorials 1 - 6, and for a variety of other examples throughout this documentation. The second task is a complex task, containing both spanned and spanless annotations and multiple attributes, some of which take other annotations as their values. This second task is used for Tutorial 7, as well as the UI documentation on editing annotations and spanless annotations. The final task is a complex task, containing spanned and spanless annotations, which is intended to illustrate how multiple content annotation sets and multiple engines can be used. This third task is used for Tutorial 8.

In the sample task, we'll exemplify both the new method of defining annotations and their attributes, and the legacy method.

The "Named Entity" task

The file typically contains a single task declaration, with <task> as the toplevel element. However, if you wish to declare multiple tasks in the same task.xml file, it can also contain multiple <task> elements, within a <tasks> element. Here, we will define three tasks. Each task must be named, and declare supported languages:

     1	<tasks>
2 <task name='Named Entity'>
3 <languages>
4 <language code='en' name='English' tokenless_autotag_delimiters='.,/?!;:'/>
5 </languages>

Each task usually contains a block of annotation declarations:

     6	    <annotations all_annotations_known='no'
7 inherit='category:zone,category:token'>
8 <span label='PERSON'
9 d_css='background-color: #CCFF66' d_accelerator='P'/>
10 <span label='LOCATION'
11 d_css='background-color: #FF99CC' d_accelerator='L'/>
12 <span label='ORGANIZATION'
13 d_css='background-color: #99CCFF' d_accelerator='O'/>
14 </annotations>

Here, we have inherited the zone and token category tags from the root task, and defined our own content tags, PERSON, LOCATION and ORGANIZATION. We also define the display properties of these tags. For instance, the PERSON tag will display as light green (defined here in hexadecimal), and the tagging menu will support the "P" keyboard accelerator for annotating a selected span with the PERSON tag.

For reference, here's the same annotation declarations and their display attributes defined by the legacy method:

      	    <annotation_set_descriptors all_annotations_known='no'
<annotation_set_descriptor category='content' name='content'>
<annotation label='PERSON'/>
<annotation label='LOCATION'/>
<annotation label='ORGANIZATION'/>
<label css='background-color: #CCFF66' name='PERSON' accelerator='P'/>
<label css='background-color: #FF99CC' name='LOCATION' accelerator='L'/>
<label css='background-color: #99CCFF' name='ORGANIZATION' accelerator='O'/>

If the task supports automated annotation (trainable or otherwise), it will define engines:

    15	    <engines>
16 <engine name='carafe_tag_engine'>
17 <default_model>default_model</default_model>
18 <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
19 <build_settings training_method='psa' max_iterations='6'/>
20 </model_config>
21 <model_config config_name='alt_model_build'
22 class='MAT.JavaCarafe.CarafeModelBuilder'/>
23 <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
24 </engine>
25 <engine name='align_engine'>
26 <step_config class='MAT.PluginMgr.AlignStep'/>
27 </engine>
28 <engine name='whole_zone_engine'>
29 <step_config class='MAT.PluginMgr.WholeZoneStep'/>
30 </engine>
31 <engine name='carafe_tokenize_engine'>
32 <step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
33 </engine>
34 </engines>

Each engine is named, and specifies a class, using the <step_config> element, which implements the automated annotation. If the engine is trainable, it will also have at least one <model_config> element (as we see for "carafe_tag_engine" above), which can support customizations for the model build settings. If it is trainable, it also optionally defines a default model file name, which is suffixed with the relevant language and step when it's referenced.

If a task supports any hand or automated annotation activity, it defines steps, which are the basic building blocks of the annotation activities:

    35	    <steps>
36 <annotation_step engine='align_engine' type='auto' name='align'/>
37 <annotation_step engine='carafe_tag_engine' sets_added='category:content'
38 type='mixed' name='carafe_tag'/>
39 <annotation_step engine='whole_zone_engine' sets_added='category:zone'
40 type='auto' name='whole_zone'/>
41 <annotation_step engine='carafe_tokenize_engine'
42 sets_added='category:token' type='auto'
43 name='carafe_tokenize'/>
44 <annotation_step type='hand' name='correct'
45 sets_modified='category:content'/>
46 </steps>

The most common step is the annotation step, and there are four subtypes:

This task has three auto steps (whole_zone, align, carafe_tokenize), a mixed step (carafe_tag), and a hand step (correct). These steps specify which annotations they modify or add, and optionally connect those annotation sets or categories with the engine which applies them.

These steps can be assembled into workflows:

    47	    <workflows>
48 <workflow name='Tokenless hand annotation'>
49 <step pretty_name='zone' name='whole_zone'/>
50 <step name='carafe_tag' pretty_name='hand tag' type='hand'/>
51 </workflow>
52 <workflow name='Review/repair'>
53 <step name='correct'/>
54 </workflow>
55 <workflow name='Demo' undoable="yes">
56 <step pretty_name='zone' name='whole_zone'/>
57 <step pretty_name='tokenize' name='carafe_tokenize'/>
58 <step pretty_name='tag' name='carafe_tag'/>
59 </workflow>
60 <workflow name='Align'>
61 <step pretty_name='zone' name='whole_zone'/>
62 <step pretty_name='tokenize' name='carafe_tokenize'/>
63 <step name='align'/>
64 </workflow>
65 </workflows>

Here's what these workflows do:

Once the workflows are defined, you can (optionally) specify the properties of workspaces. By default, you don't need to say anything additional about workspaces, since every human-annotatable workflow can serve as the basis of a workspace, but you might want to declare your default configuration (or workflow), and set special properties of workspace operations:

    66	    <workspaces default_config="Demo">
67 <workspace workflow='Demo'/>
68 </workspaces>

Here, we define a default configuration, and set up a block which can (but currently does not) customize the behavior of the "Demo" workflow in the context of the workspaces.

The "Enhanced Named Entity" task

At this point, we end the first task and begin the second one.

    69	  </task>
70 <task name='Enhanced Named Entity'>
71 <languages>
72 <language code='en' name='English' tokenless_autotag_delimiters='.,/?!;:'/>
73 </languages>

And now, we define the annotations and their displays in the second task:

    74	    <annotations all_annotations_known='no'
75 inherit='category:zone,category:token'>
76 <span label='PERSON'
77 d_css='background-color: #CCFF66' d_accelerator='P'
78 d_edit_immediately='yes'>
79 <string name='nomtype' choices="Proper name,Noun,Pronoun"/>
80 </span>
81 <span label='LOCATION'
82 d_css='background-color: #FF99CC' d_accelerator='L'
83 d_edit_immediately='yes'>
84 <string name='nomtype' choices="Proper name,Noun,Pronoun"/>
85 <boolean name='is_political_entity'/>
86 </span>
87 <span label='ORGANIZATION'
88 d_css='background-color: #99CCFF' d_accelerator='O'
89 d_edit_immediately='yes'>
90 <string name='nomtype' choices="Proper name,Noun,Pronoun"/>
91 </span>
92 <spanless label='PERSON_COREF'
93 d_css='background-color: lightgreen' d_accelerator='C'>
94 <filler_set name='mentions' filler_types='PERSON'/>
95 </spanless>
96 <span label='LOCATED_EVENT'
97 d_css='background-color: pink' d_accelerator='E'
98 d_edit_immediately='yes'>
99 <filler name='actor' filler_types='PERSON'/>
100 <filler name='location' filler_types='LOCATION,ORGANIZATION'/>
101 </span>
102 <spanless label='LOCATION_RELATION'
103 d_css='background-color: orange' d_accelerator='R'>
104 <filler name='located' filler_types='ORGANIZATION,PERSON'/>
105 <filler name='location' filler_types='LOCATION'/>
106 </spanless>
107 </annotations>

This annotation definition block is much more complex than the one in the "Named Entity" task. In addition to the three labels we saw previously, we also have three other labels: "LOCATED_EVENT" (spanned) and "PERSON_COREF" and "LOCATION_RELATION" (spanless). We also have several attributes, of different types. Most notable is the "mentions" attribute of the "PERSON_COREF" annotation, which takes sets of annotations as its value. The annotation display information is also somewhat more complex; we see here that all of the annotations are marked to be edited immediately upon creation.

For reference, here's the same annotation declarations and their display attributes defined by the legacy method:

    	    <annotation_set_descriptors all_annotations_known='no'
<annotation_set_descriptor category='content' name='content'>
<annotation label='PERSON'/>
<annotation label='LOCATION'/>
<annotation label='ORGANIZATION'/>
<attribute name='nomtype' of_annotation='PERSON,LOCATION,ORGANIZATION'>
<choice>Proper name</choice>
<attribute name='is_political_entity' type='boolean'
<annotation label='LOCATED_EVENT'/>
<attribute name='actor' type='annotation' of_annotation='LOCATED_EVENT'>
<label_restriction label='PERSON'/>
<attribute name='location' type='annotation'
<label_restriction label='LOCATION'/>
<label_restriction label='ORGANIZATION'/>
<annotation span='no' label='PERSON_COREF'/>
<attribute name='mentions' aggregation='set' type='annotation'
<label_restriction label='PERSON'/>
<annotation span='no' label='LOCATION_RELATION'/>
<attribute name='located' type='annotation'
<label_restriction label='ORGANIZATION'/>
<label_restriction label='PERSON'/>
<attribute name='location' type='annotation'
<label_restriction label='LOCATION'/>
<label css='background-color: #CCFF66' name='PERSON' accelerator='P'
<label css='background-color: #FF99CC' name='LOCATION' accelerator='L'
<label css='background-color: #99CCFF' name='ORGANIZATION' accelerator='O'
<label css='background-color: lightgreen' name='PERSON_COREF'
accelerator='C' edit_immediately='yes'/>
<label css='background-color: pink' name='LOCATED_EVENT' accelerator='E'
<label css='background-color: orange' name='LOCATION_RELATION'
accelerator='R' edit_immediately='yes'/>

The remainder of this task is essentially identical to the "Named Entity" task:

   112	    <engines>
113 <engine name='carafe_tag_engine'>
114 <default_model>default_enhanced_model</default_model>
115 <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
116 <build_settings training_method='psa' max_iterations='6'/>
117 </model_config>
118 <model_config config_name='alt_model_build'
119 class='MAT.JavaCarafe.CarafeModelBuilder'/>
120 <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
121 </engine>
122 <engine name='align_engine'>
123 <step_config class='MAT.PluginMgr.AlignStep'/>
124 </engine>
125 <engine name='whole_zone_engine'>
126 <step_config class='MAT.PluginMgr.WholeZoneStep'/>
127 </engine>
128 <engine name='carafe_tokenize_engine'>
129 <step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
130 </engine>
131 </engines>
132 <steps>
133 <annotation_step engine='align_engine' type='auto' name='align'/>
134 <annotation_step engine='carafe_tag_engine' sets_added='category:content'
135 type='mixed' name='carafe_tag'/>
136 <annotation_step engine='whole_zone_engine' sets_added='category:zone'
137 type='auto' name='whole_zone'/>
138 <annotation_step engine='carafe_tokenize_engine'
139 sets_added='category:token' type='auto'
140 name='carafe_tokenize'/>
141 <annotation_step type='hand' name='correct'
142 sets_modified='category:content'/>
143 </steps>
144 <workflows>
145 <workflow name='Tokenless hand annotation'>
146 <step pretty_name='zone' name='whole_zone'/>
147 <step name='carafe_tag' pretty_name='hand tag' type='hand'/>
148 </workflow>
149 <workflow name='Review/repair'>
150 <step name='correct'/>
151 </workflow>
152 <workflow name='Demo' undoable="yes">
153 <step pretty_name='zone' name='whole_zone'/>
154 <step pretty_name='tokenize' name='carafe_tokenize'/>
155 <step pretty_name='tag' name='carafe_tag'/>
156 </workflow>
157 <workflow name='Align'>
158 <step pretty_name='zone' name='whole_zone'/>
159 <step pretty_name='tokenize' name='carafe_tokenize'/>
160 <step name='align'/>
161 </workflow>
162 </workflows>
163 <workspaces>
164 <workspace workflow='Demo'/>
165 </workspaces>

Notably, because the jCarafe tagger only operates on the simple span subset of this (or any) task, the "Demo" workflow will only apply the spanned labels, not the attributes associated with them, and won't apply the spanless labels at all.

The "Sample Relations" task

At this point, we end the second task and begin the third:

   166	  </task>
167 <task name='Sample Relations'>
168 <languages>
169 <language code='en' name='English' tokenless_autotag_delimiters='.,/?!;:'/>
170 </languages>

This third task is intended to illustrate the impact of multiple content annotation sets: the ability to reuse tagging engines, to segregate annotation activities by annotation set, and to support multiple mixed-initiative steps in the same workflow. As part of this illustration, we've implemented an extremely simplistic two-argument trainable relation tagger, which essentially does classification of the bags of words in between successive pairs of candidate relations. We're not advertising this as a relation tagging capability for anything besides demonstrating how trainable relation tagging might be integrated. Here are the annotation sets and their displays:

   171	    <annotations all_annotations_known='no'
172 inherit='category:zone,category:token'>
173 <span label='PERSON' of_set='entities'
174 d_css='background-color: LawnGreen' d_accelerator='P'
175 d_edit_immediately='yes'/>
176 <span label='LOCATION' of_set='entities'
177 d_css='background-color: HotPink' d_accelerator='L'
178 d_edit_immediately='yes'/>
179 <span label='ORGANIZATION' of_set='entities'
180 d_css='background-color: DeepSkyBlue' d_accelerator='O'
181 d_edit_immediately='yes'/>
182 <span label='NATIONALITY' of_set='nationality'
183 d_css='background-color: PaleVioletRed' d_accelerator='N'
184 d_edit_immediately='yes'/>
185 <spanless label="Employment" of_set='relations'
186 d_css="background-color: Gray">
187 <filler name="Employee" filler_types="PERSON"/>
188 <filler name="Employer" filler_types="ORGANIZATION,LOCATION,NATIONALITY"/>
189 </spanless>
190 <spanless label="Located" of_set='relations'
191 d_css="background-color: Thistle">
192 <filler name="Located-Entity" filler_types="PERSON,ORGANIZATION"/>
193 <filler name="Location" filler_types="LOCATION,NATIONALITY"/>
194 </spanless>
195 </annotations>

Notice that there are three annotation sets rather than one, and while they're each in category "content", they each have a different set name.

For reference, here's the same annotation declarations and their display attributes defined by the legacy method:

   	    <annotation_set_descriptors all_annotations_known='no'
<annotation_set_descriptor category='content' name='entities'>
<annotation label='PERSON'/>
<annotation label='LOCATION'/>
<annotation label='ORGANIZATION'/>
<annotation_set_descriptor category='content' name='nationality'>
<annotation label='NATIONALITY'/>
<annotation_set_descriptor category='content' name='relations'>
<annotation label="Employment" span="no"/>
<attribute name="Employee" of_annotation="Employment" type="annotation">
<label_restriction label="PERSON"/>
<attribute name="Employer" of_annotation="Employment" type="annotation">
<label_restriction label="ORGANIZATION"/>
<label_restriction label='LOCATION'/>
<label_restriction label="NATIONALITY"/>
<annotation label="Located" span="no"/>
<attribute name="Located-Entity" of_annotation="Located" type="annotation">
<label_restriction label="PERSON"/>
<label_restriction label="ORGANIZATION"/>
<attribute name="Location" of_annotation="Located" type="annotation">
<label_restriction label="LOCATION"/>
<label_restriction label="NATIONALITY"/>
<label css='background-color: LawnGreen' name='PERSON' accelerator='P'
<label css='background-color: HotPink' name='LOCATION' accelerator='L'
<label css='background-color: DeepSkyBlue' name='ORGANIZATION' accelerator='O'
<label css='background-color: PaleVioletRed' name='NATIONALITY' accelerator='N'
<label name="Employment" css="background-color: Gray" edit_immediately="yes"/>
<label name="Located" css="background-color: Thistle" edit_immediately="yes"/>

The next element, which doesn't appear in the other two tasks, supports a range of Web UI customizations. In this case, we specify that we want the annotations to appear in the annotation menu and legend in the order they're defined, not in alphabetical order:

   196	    <web_customization alphabetize_labels="no"/>

Next, we define the engines.

   197	    <engines>
198 <engine name='carafe_tag_engine'>
199 <default_model>default_model</default_model>
200 <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
201 <build_settings training_method='psa' max_iterations='6'/>
202 </model_config>
203 <model_config config_name='alt_model_build'
204 class='MAT.JavaCarafe.CarafeModelBuilder'/>
205 <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
206 </engine>
207 <engine name='trivial_relation_tag_engine'>
208 <default_model>default_relation_model</default_model>
209 <model_config class='TrivialRelationTagger.CarafeMaxentRelationModelBuilder'/>
210 <step_config class='TrivialRelationTagger.CarafeRelationTagStep'/>
211 </engine>
212 <engine name='align_engine'>
213 <step_config class='MAT.PluginMgr.AlignStep'/>
214 </engine>
215 <engine name='whole_zone_engine'>
216 <step_config class='MAT.PluginMgr.WholeZoneStep'/>
217 </engine>
218 <engine name='carafe_tokenize_engine'>
219 <step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
220 </engine>
221 </engines>

In addition to the engines declared in the other two tasks, this task includes the "trivial_relation_tag_engine", which implements the simplistic relation tagging we described a moment ago.

The important differences arise in the definition of the steps:

   222	    <steps>
223 <annotation_step engine='align_engine' type='auto' name='align'/>
224 <annotation_step engine='carafe_tag_engine' sets_added='entities'
225 type='mixed' name='entity_tag'/>
226 <annotation_step engine='carafe_tag_engine' sets_added='nationality'
227 type='mixed' name='nationality_tag'/>
228 <annotation_step engine='carafe_tag_engine' sets_added='entities,nationality'
229 type='mixed' name='all_entity_tag'/>
230 <annotation_step engine='trivial_relation_tag_engine' sets_added='relations'
231 type='mixed' name='relation_tag'/>
232 <annotation_step engine='whole_zone_engine' sets_added='category:zone'
233 type='auto' name='whole_zone'/>
234 <annotation_step engine='carafe_tokenize_engine'
235 sets_added='category:token' type='auto'
236 name='carafe_tokenize'/>
237 <annotation_step type='hand' name='correct'
238 sets_modified='category:content'/>
239 </steps>

Here, we see that the "carafe_tag_engine" is used in three different steps: "entity_tag", "nationality_tag", and "all_entity_tag". This last step adds two annotation sets, rather than one. When a trainable engine defines a default model, the model path is suffixed with both the language and the step name when it's referenced and used, so these three cases will be kept separate. We also see that the "trivial_relation_tag_engine" is used in the "relation_tag" step. All four of these steps are mixed steps; so you'll be able to annotate the sets they add by hand de novo, or pretag and correct if a model is available. Finally, we have the workflows and workspaces (and then we end the task and the file):

   240	    <workflows>
241 <workflow name='Mixed Initiative Annotation' undoable="yes">
242 <step pretty_name='zone' name='whole_zone'/>
243 <step pretty_name='tokenize' name='carafe_tokenize'/>
244 <step name='entity_tag' pretty_name='tag entities'/>
245 <step name='nationality_tag' pretty_name='tag nationalities'/>
246 <step name='relation_tag' pretty_name='tag relations'/>
247 </workflow>
248 <workflow name='Review/repair all steps'>
249 <step name='correct'/>
250 </workflow>
251 <workflow name='Demo'>
252 <step pretty_name='zone' name='whole_zone'/>
253 <step pretty_name='tokenize' name='carafe_tokenize'/>
254 <step pretty_name='tag all entities' name='all_entity_tag'/>
255 <step pretty_name='tag relations' name='relation_tag'/>
256 </workflow>
257 <workflow name='Align'>
258 <step pretty_name='zone' name='whole_zone'/>
259 <step pretty_name='tokenize' name='carafe_tokenize'/>
260 <step name='align'/>
261 </workflow>
262 </workflows>
263 <workspaces default_config="Mixed Initiative Annotation">
264 <workspace workflow='Mixed Initiative Annotation'/>
265 </workspaces>
266 </task>
267 </tasks>

The "Mixed Initiative Annotation" workflow illustrates how multi-step mixed initiative works: it contains several mixed steps, each responsible for a different content annotation set. The annotator will complete the entity annotations, and then the nationality annotations, and then the relation annotations. In each case, the annotator will have the option of pretagging; most significantly, the annotator will have the opportunity to correct the spans before pretagging the relations.