Now that we've covered file
mode in the first five tutorials, we're going to address
workspace mode. In
workspace mode, you don't have nearly as much control over
On the other hand, you don't need to worry about any of those
things, either.
We're going to use the same simple 'Named Entity' task, and we're going to assume that your task is installed. This tutorial involves both the UI and the command line. Because this tutorial involves the command line, make sure you're familiar with the "Conventions" section in your platform-specific instructions in the "Getting Started" section of the documentation.
The only way to create a workspace is on the command line. We use
MATWorkspaceEngine. The
first argument of MATWorkspaceEngine is the path of the affected
workspace, and the second argument is the operation. Options and
arguments for the chosen operation follow.
Creating a workspace requires a task, so we provide the --task
directive. Workspaces also track annotation progress by user, so
we need at least one user name to create the workspace.
Every workflow that has at least one hand-annotatable step can be made into a workspace. Your task may have a default workflow for your workspace; in the case of the 'Named Entity' task, the default workflow is the same "Demo" workflow we've been working with up to now. If we want to use the default workflow, we don't need to specify it.
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace create \
--task 'Named Entity' --initial_users user1
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace create \
--task "Named Entity" --initial_users user1
Created workspace for task 'Named Entity' in directory ...
You now have a workspace in the specified directory, built on top
of the "Demo" workflow of the "Named Entity" task.
If you're interested in the structure of a workspace, look here.
Workspaces organize files by folders, and they track the status of the files as they're processed. The "core" folder supports all the normal annotation functions. We'll begin by importing a single raw file into the core folder.
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" sample/ne/resources/data/raw/voa2.txt
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" %CD%\sample\ne\resources\data\raw\voa2.txt
So here we use the "import" operation, which takes two arguments:
the folder name ("core") and the file to import. We've also used
the --strip_suffix directive to modify the name by which the
workspace knows the file. Finally, we've told the workspace
engine, via the --file_type option, that the file we're importing
is a raw file (rather than a rich MAT
JSON file). For more details on importing documents, see here.
We can see the contents of the workspace (and of each folder),
with the "list" operation:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list "core"
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list "core"
core:
voa2 (status unannotated, openable True, current step carafe_tag, workflow status awaiting hand annotation)
Note that the listing tells you the status
of the document.
You can only import a file name once. If you try to import the
file again, you'll get an error:
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" sample/ne/resources/data/raw/voa2.txt
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" sample\ne\resources\data\raw\voa2.txt
Error: Basenames for files sample/ne/resources/data/raw/voa2.txt already exist in workspace; no files imported.
In other words, once you create a particular basename in the
workspace using the "import" operation, you can't do it again.
In this step, we're going to learn about the UI aspects of the
workspace.
First, start up the UI as we described in tutorial 1.
Note: when you start up the Web server in its default mode, workspaces will only be accessible from a browser client running on the same host. There are many options available to the Web server at startup which affect the workspaces, so if you want to use workspaces in the UI, we recommend that you familiarize yourself with the MATWeb documentation.
In the terminal in which you're running the Web server, you'll
see this when it starts up:
Web server started on port 7801.
Web server command loop. Commands are:
exit - exit the command loop and stop the Web server
loopexit - exit the command loop, but leave the Web server running
taggerexit - shut down the tagger service, if it's running
restart - restart the Web server
ws_key - show the workspace key
help, ? - this message
Workspace key is XJ9dGBaCNveYHk9CZzw6wTM5WH8x05y1
Command:
Note the workspace key. This key is randomly generated, and known
only to the user who starts the Web server. This key must be
provided to the UI when the user opens the workspace. This simple
security feature ensures that even though the Web server will be
modifying the workspace, it does so if the UI user has proved that
s/he has the appropriate access. For more about workspace security
and the UI, see here.
Next:
You should see a window that looks like this:
A single left click on the file name in the workspace tab should
open the file. You'll see that this document has been prepared for
annotation (it has been zoned and tokenized, in particular).
You'll see in the controls on the right that its status, as shown
in the listing above, is "unannotated", which means that no human
annotator has touched it yet, and you'll see the current step
marked, since your workflow may have multiple steps in which you
can perform hand annotation:
Note how the controls area here differs from the one in file
mode:
If you select the folder tab now, you'll see that the document is
now listed as "locked by user1". Workspaces maintain document
locks to ensure that no one else trounces your changes. This lock
will be freed when you close the document.
At this point, you can annotate your document as you did in Tutorial 1. If you want to leave the
workspace without finishing your annotation, just select the Save
operation in the operations menu and press Go; you can always
return to the document. Once you're satisfied with your
annotations, select "Mark gold" in the operations menu and press
Go; your document will be saved and the document status updated.
Finally, close the document. In a minute, we're going to do some
automated tagging in the workspace, and currently this is not
possible while documents are locked.
You'd typically annotate several documents in the first round
before building a model, but we want to move directly to that
step. Since we only have one hand-annotated document at the
moment, what we're going to do is import some other documents into
the workspace. We're going to import some of the annotated
documents that come with the Named Entity task into the core
folder; these documents are already marked internally as
gold-standard reconciled documents (i.e., in addition to being
marked gold, their correctness has been validated by further
review). We're also going to import one of them as a raw document.
Unix:
% cd $MAT_PKG_HOME
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" sample/ne/resources/data/raw/voa1.txt
% bin/MATWorkspaceEngine /tmp/ne_workspace import --strip_suffix ".txt.json" \
"core" sample/ne/resources/data/json/voa[3-9].txt.json
Windows native:
> cd %MAT_PKG_HOME%
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt" \
--file_type raw "core" sample\ne\resources\data\raw\voa1.txt
> bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace import --strip_suffix ".txt.json" \
"core" sample\ne\resources\data\json\voa3.txt.json \
sample\ne\resources\data\json\voa4.txt.json \
sample\ne\resources\data\json\voa5.txt.json \
sample\ne\resources\data\json\voa6.txt.json \
sample\ne\resources\data\json\voa7.txt.json \
sample\ne\resources\data\json\voa8.txt.json \
sample\ne\resources\data\json\voa9.txt.json
Now, let's list the workspace to see what we have:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list
core:
voa1 (status unannotated, openable True, current step tag, workflow status awaiting hand annotation)
voa2 (status gold, locked by user1, workflow status done, current step tag, openable False)
voa3 (status reconciled, openable True, current step tag, workflow status done)
voa4 (status reconciled, openable True, current step tag, workflow status done)
voa5 (status reconciled, openable True, current step tag, workflow status done)
voa6 (status reconciled, openable True, current step tag, workflow status done)
voa7 (status reconciled, openable True, current step tag, workflow status done)
voa8 (status reconciled, openable True, current step tag, workflow status done)
voa9 (status reconciled, openable True, current step tag, workflow status done)
review:
export:
reconciliation:
You can see that the document you tagged is marked gold, and the
documents you just imported are marked reconciled. And finally,
you can see that there is one document - the raw document you just
imported - which is marked unannotated.
Now, we build a model. Workspace models are completely distinct from from
default task models, like the one we built in Tutorial 2. They're built exclusively
from the documents in the workspace.
This is a command line operation only. We're going to ask the
workspace to autotag afterwards, which should mark "voa1" as
uncorrected (since now it's been automatically annotated). Each
time we build a model and autotag, any documents that are either
unannotated or uncorrected are autotagged.
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace modelbuild \
--do_autotag "core"
Windows native:
% %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace modelbuild \
--do_autotag "core"
Once this is done, we can look at the contents of the workspace
again:
Unix:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine /tmp/ne_workspace list
Windows native:
> %MAT_PKG_HOME%\bin\MATWorkspaceEngine.cmd %TMP%\ne_workspace list
core:
voa1 (status uncorrected, openable True, current step tag, workflow status awaiting hand annotation)
voa2 (status gold, locked by user1, workflow status done, current step tag, openable False)
voa3 (status reconciled, openable True, current step tag, workflow status done)
voa4 (status reconciled, openable True, current step tag, workflow status done)
voa5 (status reconciled, openable True, current step tag, workflow status done)
voa6 (status reconciled, openable True, current step tag, workflow status done)
voa7 (status reconciled, openable True, current step tag, workflow status done)
voa8 (status reconciled, openable True, current step tag, workflow status done)
voa9 (status reconciled, openable True, current step tag, workflow status done)
review:
export:
reconciliation:
Note that voa1, which was previously unannotated, is now
uncorrected - i.e., it's been autotagged but not hand-corrected.
The other documents, because they're gold or reconciled, were used
to create the model which the workspace applied to voa1.
Now, you'll want to hand-correct the autotagged document.
If your UI has been open while you've performed the last two
steps on the command line, the UI won't know that the state of the
workspace has changed. You can select the workspace tab and press
the "Refresh" button in the controls area. Now, the state of the
UI and the state of the workspace will be synchronized.
Select the core folder from the folder menu. You should see
"voa1", among other documents. Open it. Review the annotations and
correct whatever is needed. When the document is correct, choose
"Mark gold" and press Go, and the document will be marked gold.
In the next tutorial, we'll learn about the experiment engine. If
you want to learn how to use the experiment engine with
workspaces, don't remove your workspace.
If you're not planning on doing any other tutorials, remove the
workspace:
Unix:
% rm -rf /tmp/ne_workspace
Windows native:
> rd /s /q %TMP%\ne_workspace list
If you don't want the "Named Entity" task hanging around, remove
it as shown in the final step of Tutorial
1.