The sample tasks can be found in MAT_PKG_HOME/sample. There are three task directories found there; in this documentation, we discuss two of them. Both of these directories, like all task directories, have a file named task.xml at its root. The format of this file is described in the task XML documentation.
The first of these directories is ne. This directory
has a
python subdirectory to implement a Python engine for the
Sample Relations task below, but
no Javascript customizations (so no js subdirectory).
See "Creating a task" for a
description of the subdirectory structure of the task.
The task.xml file in the ne directory contains three
tasks: "Named Entity", "Enhanced Named Entity" and "Sample
Relations". The first task is a simple span task; it contains
spanned annotations without any complex attribute structure.
This task is used for Tutorials 1 - 6, and for a variety of other
examples throughout this documentation. The second task is a
complex task, containing both spanned and spanless annotations and
multiple attributes, some of which take other annotations as their
values. This second task is used for Tutorial
7, as well as the UI documentation on editing annotations and spanless annotations. The final task
is a complex task, containing spanned and spanless annotations,
which is intended to illustrate how multiple content annotation
sets and multiple engines can be used. This third task is used for
Tutorial 8.
The second of these directories is classification. The
task.xml file in this directory contains a single task, "Sample
Sentiment", which exemplifies using classification to assign
positive and negative sentiment to sentences.
In the sample tasks, we'll exemplify both the new method of
defining annotations and their attributes, and the legacy method.
The file typically contains a single task declaration, with <task> as the toplevel element. However, if you wish to declare multiple tasks in the same task.xml file, it can also contain multiple <task> elements, within a <tasks> element. Here, we will define three tasks. Each task must be named, and declare supported languages:
1 <tasks>
2 <task name='Named Entity'>
3 <languages>
4 <language code='en' name='English' tokenless_autotag_delimiters='.,/?!;:'/>
5 </languages>
Each task usually contains a block of annotation declarations:
6 <annotations all_annotations_known='no'
7 inherit='category:zone,category:token'>
8 <span label='PERSON'
9 d_css='background-color: #CCFF66' d_accelerator='P'/>
10 <span label='LOCATION'
11 d_css='background-color: #FF99CC' d_accelerator='L'/>
12 <span label='ORGANIZATION'
13 d_css='background-color: #99CCFF' d_accelerator='O'/>
14 </annotations>
Here, we have inherited the zone and token category tags from the
root task, and defined our own content tags, PERSON, LOCATION and
ORGANIZATION. We also define the display properties of these tags.
For instance, the PERSON tag will display as light green (defined
here in hexadecimal), and the tagging menu will support the "P"
keyboard accelerator for annotating a selected span with the
PERSON tag.
For reference, here's the same annotation declarations and their
display attributes defined by the legacy method:
<annotation_set_descriptors all_annotations_known='no'
inherit='category:zone,category:token'>
<annotation_set_descriptor category='content' name='content'>
<annotation label='PERSON'/>
<annotation label='LOCATION'/>
<annotation label='ORGANIZATION'/>
</annotation_set_descriptor>
</annotation_set_descriptors>
<annotation_display>
<label css='background-color: #CCFF66' name='PERSON' accelerator='P'/>
<label css='background-color: #FF99CC' name='LOCATION' accelerator='L'/>
<label css='background-color: #99CCFF' name='ORGANIZATION' accelerator='O'/>
</annotation_display>
If the task supports automated annotation (trainable or
otherwise), it will define engines:
15 <engines>
16 <engine name='carafe_tag_engine'>
17 <default_model>default_model</default_model>
18 <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
19 <build_settings training_method='psa' max_iterations='6'/>
20 </model_config>
21 <model_config config_name='alt_model_build'
22 class='MAT.JavaCarafe.CarafeModelBuilder'/>
23 <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
24 </engine>
25 <engine name='align_engine'>
26 <step_config class='MAT.PluginMgr.AlignStep'/>
27 </engine>
28 <engine name='whole_zone_engine'>
29 <step_config class='MAT.PluginMgr.WholeZoneStep'/>
30 </engine>
31 <engine name='carafe_tokenize_engine'>
32 <step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
33 </engine>
34 </engines>
Each engine is named, and specifies a class, using the
<step_config> element, which implements the automated
annotation. If the engine is trainable, it will also have at least
one <model_config> element (as we see for
"carafe_tag_engine" above), which can support customizations for
the model build settings. If it is trainable, it also optionally
defines a default model file name, which is suffixed with the
relevant language and step when it's referenced.
If a task supports any hand or automated annotation activity, it
defines steps, which are the basic building blocks of the
annotation activities:
35 <steps>
36 <annotation_step engine='align_engine' type='auto' name='align'/>
37 <annotation_step engine='carafe_tag_engine' sets_added='category:content'
38 type='mixed' name='carafe_tag'/>
39 <annotation_step engine='whole_zone_engine' sets_added='category:zone'
40 type='auto' name='whole_zone'/>
41 <annotation_step engine='carafe_tokenize_engine'
42 sets_added='category:token' type='auto'
43 name='carafe_tokenize'/>
44 <annotation_step type='hand' name='correct'
45 sets_modified='category:content'/>
46 </steps>
The most common step is the annotation step, and there are four
subtypes:
This task has three auto steps (whole_zone, align,
carafe_tokenize), a mixed step (carafe_tag), and a hand step
(correct). These steps specify which annotations they modify or
add, and optionally connect those annotation sets or categories
with the engine which applies them.
These steps can be assembled into workflows:
47 <workflows>
48 <workflow name='Tokenless hand annotation'>
49 <step pretty_name='zone' name='whole_zone'/>
50 <step name='carafe_tag' pretty_name='hand tag' type='hand'/>
51 </workflow>
52 <workflow name='Review/repair'>
53 <step name='correct'/>
54 </workflow>
55 <workflow name='Demo' undoable="yes">
56 <step pretty_name='zone' name='whole_zone'/>
57 <step pretty_name='tokenize' name='carafe_tokenize'/>
58 <step pretty_name='tag' name='carafe_tag'/>
59 </workflow>
60 <workflow name='Align'>
61 <step pretty_name='zone' name='whole_zone'/>
62 <step pretty_name='tokenize' name='carafe_tokenize'/>
63 <step name='align'/>
64 </workflow>
65 </workflows>
Here's what these workflows do:
Once the workflows are defined, you can (optionally) specify the
properties of workspaces. By default, you don't need to say
anything additional about workspaces, since every
human-annotatable workflow can serve as the basis of a workspace,
but you might want to declare your default configuration (or
workflow), and set special properties of workspace operations:
66 <workspaces default_config="Demo">
67 <workspace workflow='Demo'/>
68 </workspaces>
Here, we define a default configuration, and set up a block which
can (but currently does not) customize the behavior of the "Demo"
workflow in the context of the workspaces.
At this point, we end the first task and begin the second one.
69 </task>
70 <task name='Enhanced Named Entity'>
71 <languages>
72 <language code='en' name='English' tokenless_autotag_delimiters='.,/?!;:'/>
73 </languages>
And now, we define the annotations and their displays in the
second task:
74 <annotations all_annotations_known='no'
75 inherit='category:zone,category:token'>
76 <span label='PERSON'
77 d_css='background-color: #CCFF66' d_accelerator='P'
78 d_edit_immediately='yes'>
79 <string name='nomtype' choices="Proper name,Noun,Pronoun"/>
80 </span>
81 <span label='LOCATION'
82 d_css='background-color: #FF99CC' d_accelerator='L'
83 d_edit_immediately='yes'>
84 <string name='nomtype' choices="Proper name,Noun,Pronoun"/>
85 <boolean name='is_political_entity'/>
86 </span>
87 <span label='ORGANIZATION'
88 d_css='background-color: #99CCFF' d_accelerator='O'
89 d_edit_immediately='yes'>
90 <string name='nomtype' choices="Proper name,Noun,Pronoun"/>
91 </span>
92 <spanless label='PERSON_COREF'
93 d_css='background-color: lightgreen' d_accelerator='C'>
94 <filler_set name='mentions' filler_types='PERSON'/>
95 </spanless>
96 <span label='LOCATED_EVENT'
97 d_css='background-color: pink' d_accelerator='E'
98 d_edit_immediately='yes'>
99 <filler name='actor' filler_types='PERSON'/>
100 <filler name='location' filler_types='LOCATION,ORGANIZATION'/>
101 </span>
102 <spanless label='LOCATION_RELATION'
103 d_css='background-color: orange' d_accelerator='R'>
104 <filler name='located' filler_types='ORGANIZATION,PERSON'/>
105 <filler name='location' filler_types='LOCATION'/>
106 </spanless>
107 </annotations>
This annotation definition block is much more complex than the one in the "Named Entity" task. In addition to the three labels we saw previously, we also have three other labels: "LOCATED_EVENT" (spanned) and "PERSON_COREF" and "LOCATION_RELATION" (spanless). We also have several attributes, of different types. Most notable is the "mentions" attribute of the "PERSON_COREF" annotation, which takes sets of annotations as its value. The annotation display information is also somewhat more complex; we see here that all of the annotations are marked to be edited immediately upon creation.
For reference, here's the same annotation declarations and their
display attributes defined by the legacy method:
<annotation_set_descriptors all_annotations_known='no'
inherit='category:zone,category:token'>
<annotation_set_descriptor category='content' name='content'>
<annotation label='PERSON'/>
<annotation label='LOCATION'/>
<annotation label='ORGANIZATION'/>
<attribute name='nomtype' of_annotation='PERSON,LOCATION,ORGANIZATION'>
<choice>Proper name</choice>
<choice>Noun</choice>
<choice>Pronoun</choice>
</attribute>
<attribute name='is_political_entity' type='boolean'
of_annotation='LOCATION'/>
<annotation label='LOCATED_EVENT'/>
<attribute name='actor' type='annotation' of_annotation='LOCATED_EVENT'>
<label_restriction label='PERSON'/>
</attribute>
<attribute name='location' type='annotation'
of_annotation='LOCATED_EVENT'>
<label_restriction label='LOCATION'/>
<label_restriction label='ORGANIZATION'/>
</attribute>
<annotation span='no' label='PERSON_COREF'/>
<attribute name='mentions' aggregation='set' type='annotation'
of_annotation='PERSON_COREF'>
<label_restriction label='PERSON'/>
</attribute>
<annotation span='no' label='LOCATION_RELATION'/>
<attribute name='located' type='annotation'
of_annotation='LOCATION_RELATION'>
<label_restriction label='ORGANIZATION'/>
<label_restriction label='PERSON'/>
</attribute>
<attribute name='location' type='annotation'
of_annotation='LOCATION_RELATION'>
<label_restriction label='LOCATION'/>
</attribute>
</annotation_set_descriptor>
</annotation_set_descriptors>
<annotation_display>
<label css='background-color: #CCFF66' name='PERSON' accelerator='P'
edit_immediately='yes'/>
<label css='background-color: #FF99CC' name='LOCATION' accelerator='L'
edit_immediately='yes'/>
<label css='background-color: #99CCFF' name='ORGANIZATION' accelerator='O'
edit_immediately='yes'/>
<label css='background-color: lightgreen' name='PERSON_COREF'
accelerator='C' edit_immediately='yes'/>
<label css='background-color: pink' name='LOCATED_EVENT' accelerator='E'
edit_immediately='yes'/>
<label css='background-color: orange' name='LOCATION_RELATION'
accelerator='R' edit_immediately='yes'/>
</annotation_display>
The remainder of this task is essentially identical to the "Named
Entity" task:
112 <engines>
113 <engine name='carafe_tag_engine'>
114 <default_model>default_enhanced_model</default_model>
115 <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
116 <build_settings training_method='psa' max_iterations='6'/>
117 </model_config>
118 <model_config config_name='alt_model_build'
119 class='MAT.JavaCarafe.CarafeModelBuilder'/>
120 <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
121 </engine>
122 <engine name='align_engine'>
123 <step_config class='MAT.PluginMgr.AlignStep'/>
124 </engine>
125 <engine name='whole_zone_engine'>
126 <step_config class='MAT.PluginMgr.WholeZoneStep'/>
127 </engine>
128 <engine name='carafe_tokenize_engine'>
129 <step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
130 </engine>
131 </engines>
132 <steps>
133 <annotation_step engine='align_engine' type='auto' name='align'/>
134 <annotation_step engine='carafe_tag_engine' sets_added='category:content'
135 type='mixed' name='carafe_tag'/>
136 <annotation_step engine='whole_zone_engine' sets_added='category:zone'
137 type='auto' name='whole_zone'/>
138 <annotation_step engine='carafe_tokenize_engine'
139 sets_added='category:token' type='auto'
140 name='carafe_tokenize'/>
141 <annotation_step type='hand' name='correct'
142 sets_modified='category:content'/>
143 </steps>
144 <workflows>
145 <workflow name='Tokenless hand annotation'>
146 <step pretty_name='zone' name='whole_zone'/>
147 <step name='carafe_tag' pretty_name='hand tag' type='hand'/>
148 </workflow>
149 <workflow name='Review/repair'>
150 <step name='correct'/>
151 </workflow>
152 <workflow name='Demo' undoable="yes">
153 <step pretty_name='zone' name='whole_zone'/>
154 <step pretty_name='tokenize' name='carafe_tokenize'/>
155 <step pretty_name='tag' name='carafe_tag'/>
156 </workflow>
157 <workflow name='Align'>
158 <step pretty_name='zone' name='whole_zone'/>
159 <step pretty_name='tokenize' name='carafe_tokenize'/>
160 <step name='align'/>
161 </workflow>
162 </workflows>
163 <workspaces>
164 <workspace workflow='Demo'/>
165 </workspaces>
Notably, because the jCarafe tagger only operates on the simple span subset of this (or any) task, the "Demo" workflow will only apply the spanned labels, not the attributes associated with them, and won't apply the spanless labels at all.
At this point, we end the second task and begin the third:
166 </task>
167 <task name='Sample Relations'>
168 <languages>
169 <language code='en' name='English' tokenless_autotag_delimiters='.,/?!;:'/>
170 </languages>
This third task is intended to illustrate the impact of multiple
content annotation sets: the ability to reuse tagging engines, to
segregate annotation activities by annotation set, and to support
multiple mixed-initiative steps in the same workflow. As part of
this illustration, we've implemented an extremely simplistic
two-argument trainable relation tagger, which essentially does
classification of the bags of words in between successive pairs of
candidate relations. We're not advertising this as a relation
tagging capability for anything besides demonstrating how
trainable relation tagging might be integrated. Here are the
annotation sets and their displays:
171 <annotations all_annotations_known='no'
172 inherit='category:zone,category:token'>
173 <span label='PERSON' of_set='entities'
174 d_css='background-color: LawnGreen' d_accelerator='P'
175 d_edit_immediately='yes'/>
176 <span label='LOCATION' of_set='entities'
177 d_css='background-color: HotPink' d_accelerator='L'
178 d_edit_immediately='yes'/>
179 <span label='ORGANIZATION' of_set='entities'
180 d_css='background-color: DeepSkyBlue' d_accelerator='O'
181 d_edit_immediately='yes'/>
182 <span label='NATIONALITY' of_set='nationality'
183 d_css='background-color: PaleVioletRed' d_accelerator='N'
184 d_edit_immediately='yes'/>
185 <spanless label="Employment" of_set='relations'
186 d_css="background-color: Gray">
187 <filler name="Employee" filler_types="PERSON"/>
188 <filler name="Employer" filler_types="ORGANIZATION,LOCATION,NATIONALITY"/>
189 </spanless>
190 <spanless label="Located" of_set='relations'
191 d_css="background-color: Thistle">
192 <filler name="Located-Entity" filler_types="PERSON,ORGANIZATION"/>
193 <filler name="Location" filler_types="LOCATION,NATIONALITY"/>
194 </spanless>
195 </annotations>
Notice that there are three annotation sets rather than one, and
while they're each in category "content", they each have a
different set name.
For reference, here's the same annotation declarations and their
display attributes defined by the legacy method:
<annotation_set_descriptors all_annotations_known='no'
inherit='category:zone,category:token'>
<annotation_set_descriptor category='content' name='entities'>
<annotation label='PERSON'/>
<annotation label='LOCATION'/>
<annotation label='ORGANIZATION'/>
</annotation_set_descriptor>
<annotation_set_descriptor category='content' name='nationality'>
<annotation label='NATIONALITY'/>
</annotation_set_descriptor>
<annotation_set_descriptor category='content' name='relations'>
<annotation label="Employment" span="no"/>
<attribute name="Employee" of_annotation="Employment" type="annotation">
<label_restriction label="PERSON"/>
</attribute>
<attribute name="Employer" of_annotation="Employment" type="annotation">
<label_restriction label="ORGANIZATION"/>
<label_restriction label='LOCATION'/>
<label_restriction label="NATIONALITY"/>
</attribute>
<annotation label="Located" span="no"/>
<attribute name="Located-Entity" of_annotation="Located" type="annotation">
<label_restriction label="PERSON"/>
<label_restriction label="ORGANIZATION"/>
</attribute>
<attribute name="Location" of_annotation="Located" type="annotation">
<label_restriction label="LOCATION"/>
<label_restriction label="NATIONALITY"/>
</attribute>
</annotation_set_descriptor>
</annotation_set_descriptors>
<annotation_display>
<label css='background-color: LawnGreen' name='PERSON' accelerator='P'
edit_immediately='yes'/>
<label css='background-color: HotPink' name='LOCATION' accelerator='L'
edit_immediately='yes'/>
<label css='background-color: DeepSkyBlue' name='ORGANIZATION' accelerator='O'
edit_immediately='yes'/>
<label css='background-color: PaleVioletRed' name='NATIONALITY' accelerator='N'
edit_immediately='yes'/>
<label name="Employment" css="background-color: Gray" edit_immediately="yes"/>
<label name="Located" css="background-color: Thistle" edit_immediately="yes"/>
</annotation_display>
The next element, which doesn't appear in the other two tasks,
supports a range of Web UI customizations. In this case, we
specify that we want the annotations to appear in the annotation
menu and legend in the order they're defined, not in alphabetical
order:
196 <web_customization alphabetize_labels="no"/>
Next, we define the engines.
197 <engines>
198 <engine name='carafe_tag_engine'>
199 <default_model>default_model</default_model>
200 <model_config class='MAT.JavaCarafe.CarafeModelBuilder'>
201 <build_settings training_method='psa' max_iterations='6'/>
202 </model_config>
203 <model_config config_name='alt_model_build'
204 class='MAT.JavaCarafe.CarafeModelBuilder'/>
205 <step_config class='MAT.JavaCarafe.CarafeTagStep'/>
206 </engine>
207 <engine name='trivial_relation_tag_engine'>
208 <default_model>default_relation_model</default_model>
209 <model_config class='TrivialRelationTagger.CarafeMaxentRelationModelBuilder'/>
210 <step_config class='TrivialRelationTagger.CarafeRelationTagStep'/>
211 </engine>
212 <engine name='align_engine'>
213 <step_config class='MAT.PluginMgr.AlignStep'/>
214 </engine>
215 <engine name='whole_zone_engine'>
216 <step_config class='MAT.PluginMgr.WholeZoneStep'/>
217 </engine>
218 <engine name='carafe_tokenize_engine'>
219 <step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
220 </engine>
221 </engines>
In addition to the engines declared in the other two tasks, this
task includes the "trivial_relation_tag_engine", which implements
the simplistic relation tagging we described a moment ago.
The important differences arise in the definition of the steps:
222 <steps>
223 <annotation_step engine='align_engine' type='auto' name='align'/>
224 <annotation_step engine='carafe_tag_engine' sets_added='entities'
225 type='mixed' name='entity_tag'/>
226 <annotation_step engine='carafe_tag_engine' sets_added='nationality'
227 type='mixed' name='nationality_tag'/>
228 <annotation_step engine='carafe_tag_engine' sets_added='entities,nationality'
229 type='mixed' name='all_entity_tag'/>
230 <annotation_step engine='trivial_relation_tag_engine' sets_added='relations'
231 type='mixed' name='relation_tag'/>
232 <annotation_step engine='whole_zone_engine' sets_added='category:zone'
233 type='auto' name='whole_zone'/>
234 <annotation_step engine='carafe_tokenize_engine'
235 sets_added='category:token' type='auto'
236 name='carafe_tokenize'/>
237 <annotation_step type='hand' name='correct'
238 sets_modified='category:content'/>
239 </steps>
Here, we see that the "carafe_tag_engine" is used in three
different steps: "entity_tag", "nationality_tag", and
"all_entity_tag". This last step adds two annotation sets, rather
than one. When a trainable engine defines a default model, the
model path is suffixed with both the language and the step name
when it's referenced and used, so these three cases will be kept
separate. We also see that the "trivial_relation_tag_engine" is
used in the "relation_tag" step. All four of these steps are mixed
steps; so you'll be able to annotate the sets they add by hand de
novo, or pretag and correct if a model is available.
Finally, we have the workflows and workspaces (and then we end the
task and the file):
240 <workflows>
241 <workflow name='Mixed Initiative Annotation' undoable="yes">
242 <step pretty_name='zone' name='whole_zone'/>
243 <step pretty_name='tokenize' name='carafe_tokenize'/>
244 <step name='entity_tag' pretty_name='tag entities'/>
245 <step name='nationality_tag' pretty_name='tag nationalities'/>
246 <step name='relation_tag' pretty_name='tag relations'/>
247 </workflow>
248 <workflow name='Review/repair all steps'>
249 <step name='correct'/>
250 </workflow>
251 <workflow name='Demo'>
252 <step pretty_name='zone' name='whole_zone'/>
253 <step pretty_name='tokenize' name='carafe_tokenize'/>
254 <step pretty_name='tag all entities' name='all_entity_tag'/>
255 <step pretty_name='tag relations' name='relation_tag'/>
256 </workflow>
257 <workflow name='Align'>
258 <step pretty_name='zone' name='whole_zone'/>
259 <step pretty_name='tokenize' name='carafe_tokenize'/>
260 <step name='align'/>
261 </workflow>
262 </workflows>
263 <workspaces default_config="Mixed Initiative Annotation">
264 <workspace workflow='Mixed Initiative Annotation'/>
265 </workspaces>
266 </task>
267 </tasks>
The "Mixed Initiative Annotation" workflow illustrates how
multi-step mixed initiative works: it contains several mixed
steps, each responsible for a different content annotation set.
The annotator will complete the entity annotations, and then the
nationality annotations, and then the relation annotations. In
each case, the annotator will have the option of pretagging; most
significantly, the annotator will have the opportunity to correct
the spans before pretagging the relations.
The task.xml file which contains this task contains only this
task, so its toplevel element is <task>. illustrates
sentence classification.
We name the task and declare the language information:
1 <task name='Sample Sentiment'>
2 <languages>
3 <language code='en' name='English' tokenless_autotag_delimiters='.,/?!;:'/>
4 </languages>
Next, we declare the annotations. We don't bother with zone
annotations for this task, since the whole document is annotatable
and unzoned documents are treated as such. So we only inherit category:token.
5 <annotations all_annotations_known='no' inherit='category:token'>
6 <annotation_set name="structure" managed="no"/>
7 <span label="sentence" of_set="structure" d_rendering_style="background_span"
8 d_css="background-color: lightgray">
9 <string name="sentiment" of_set="sentiment">
10 <choice value="positive" d_accelerator="P" d_css="background-color: green"/>
11 <choice value="negative" d_accelerator="N" d_css="background-color: red; color: white"/>
12 </string>
13 </span>
14 </annotations>
There are a number of important features in this annotation
declaration block.
First, the <annotation_set> element declares a new
annotation set named structure, which is marked as managed="no".
Most annotation sets are managed, the significance of which is
discussed here. However, for
the structure annotations, there isn't going to be anything to
manage, because they won't be hand annotated or corrected. In most
circumstances, we wouldn't care that there's no advantage to
managing this set; but in this case, the same engine which
generates the tokens (the jCarafe tokenizer) also generates the
sentences, and the token annotations are unmanaged, and no step
(and thus, no engine) can simultaneously add managed and unmanaged
sets.
Next, we define a sentence span annotation, and assign it to the
structure set. We've declared the rendering style for
this annotation to be background_span, which forces the
annotation to be styled behind the text, rather than stacked on
top of it. This would be important, from the point of view of
presentation, if this task were also adding other span annotations
(it isn't, but it's useful to illustrate the functionality here).
Next, we define an attribute of the sentence span, named sentiment.
We put this attribute in a separate annotation set, also named sentiment.
The reason we separate it is that we're going to use the
maximum-entropy trainer/tagger to learn the value of this
attribute, and the annotation sets which are added or learned by
this engine must consist exclusively of attributes.
Next, we define two choices for this attribute: positive
and negative. Note that we do not define a default value
for the attribute, so the sentence classifier will add a "null"
label during classification, and leave all "null"-labeled elements
unmarked. In other words, the distinction modeled here is actually
a three-way distinction.
Finally, we assign styling to these two choices, so we can
distinguish visually between positive, negative, and unmarked
sentiment sentences.
This last feature (choice styling) is not available in the legacy
annotation declaration format, and so we do not present the legacy
version of these declarations.
Next, we declare the engines:
15 <engines>
16 <engine name='carafe_tokenize_engine'>
17 <step_config class='MAT.JavaCarafe.CarafeTokenizationStep'/>
18 </engine>
19 <engine name='classifier_engine'>
20 <model_config class='MAT.JavaCarafe.JCarafeMaxentClassifierModelBuilder'>
21 <build_settings feature_extractors="_bagOfWords,_bigrams"/>
22 </model_config>
23 <step_config class='MAT.JavaCarafe.JCarafeMaxentClassifierTagStep'/>
24 </engine>
25 </engines>
These declarations consist of the standard jCarafe
tokenizer/sentence segmenter, and the jCarafe maximum-entropy
classifier, which uses bags of words and bigrams for each sentence
as its classification features.
Next, we declare the steps:
26 <steps>
27 <annotation_step engine="carafe_tokenize_engine" sets_added="category:token,structure"
28 type="auto" name="carafe_structure"/>
29 <annotation_step engine='classifier_engine'
30 sets_added='sentiment' type='mixed'
31 name='attribute_tag'/>
32 </steps>
We declare only two steps. The first uses the tokenizer engine
and adds the token annotations, and the annotations from the structure
set (i.e., the sentences). The second engine trains for and
applies the attributes in the sentiment set. We enable
mixed-initiative annotation (i.e., pretagging plus correction) for
this second step. The single workflow in this task can be undone,
and uses these two steps:
33 <workflows>
34 <workflow name="Demo" undoable="yes">
35 <step name="carafe_structure">
36 <run_settings sentence_label="sentence"/>
37 </step>
38 <step name='attribute_tag'/>
39 </workflow>
40 </workflows>
Note that in order to retrieve the sentences, we have to pass a
runtime setting to the step which specifies the label name of the
desired sentence annotations.
Finally, we declare a score profile, and end:
41 <score_profile>
42 <attrs_alone true_labels="sentence"/>
43 </score_profile>
44 </task>
The score profile is not so important in this task, but it's
crucial in more complex tasks. The <attrs_alone> element
tells the scorer to break out separate scores for the performance
of the individual attributes, in addition to evaluating the
overall correctness of the sentence annotation itself. The span
and label of the sentence annotation here is never in doubt; it's
always going to be the same for any document it's run on, because
it's not trained. The overall correctness of the sentence
annotation, then, comes down to the value of the sentiment
attribute. So for the purposes of this task, there's no need to
break out the separate attribute score. However, if there were
more than one attribute on the sentence that we were training for,
we would want to break out the separate contribution of each
attribute so we could evaluate them independently without having
to do separate runs for each attribute.