Workspaces provide a guided, structured way of managing and
processing your documents. Make
sure that this is what you want. Workspace mode is provided
by MATWorkspaceEngine on
the command line, and via "File -> Open workspace..." in the
Web UI. You can find a summary of the highlights about using
workspaces here; this document
provides the details.
Workspaces are just directories. The structure of these
directories looks like this:
When you invoke MATEngine, the file
mode workflow processing tool, you have to provide a good deal of
information, and you have to know a good deal about your document:
Workspaces track these things for you, and much more. When you create your workspace, you specify a task,
language, and workflow, and from that point on, the workspace will
manage the progress of each document through the workflow for you.
When you create a workspace, you can specify the workflow using
the --workspace_config option. The value of this option is either
a workflow name, or the name of a workspace configuration based on
that workflow which customizes some of the operations used in your
workspace (see here, here,
here, and here).
You can define workspace configurations in your task.xml file.
The goal of the workspace is to try to ensure that each document
the workspace is processing is always ready for hand annotation.
So when you import documents into a workspace, they'll be advanced
to the first hand-annotatable step in the workspace's workflow;
when you declare that you're done with a step, the document will
be advanced to the next hand-annotatable step (if any). All
intervening automated processing, including applying the
appropriate trainable models, is done for you. The workspace keeps
track of the current step the document is in.
In MAT, all documents in workspaces are now closely tracked for
their annotation state in the current step. In the current step,
documents can be:
The document state tracking in the workspaces includes tracking
who modified the annotations in the documents. As a result, every
document edit in workspaces is linked to a workspace user.
The inventory of users of a workspace is entirely up to its
creators and managers. Every workspace must be created with at
least one initial user. The names of these users are not bound to
any external resource; they're not required to be the same as
login names, for instance. They're merely there to provide a way
of attributing document changes. There's no account management or
passwords; you can "claim" to be any registered user you want to
claim to be when you edit a workspace. We're assuming that you're
using MAT workspaces in a cooperative environment in which this
sort of inappropriate behavior won't arise.
Although there's no requirement that registered user names
correspond to external resources like login names, you may find it
easiest to use login names anyway, so that your workspace
annotators don't have to remember a different name when they open
a workspace.
Documents may be editable by any workspace user, or might be assigned to a particular user. If a document is assigned to someone other than you, you'll be able to view it, but not edit it, in the UI.
Workspace users are assigned roles, which indicate what
they can do within the workspace. By default, all users can
annotate documents in the core folder. Users may also have an
optional 'reviewer' role, which allows them to perform human
reviews of other annotators' work and to reconcile documents.
The available operations are:
topic |
operation |
availability |
configurable in workspace configuration |
folder |
---|---|---|---|---|
creation |
create |
command line |
no |
(global) |
file management |
import |
command line |
yes |
(global) |
remove |
command line |
no |
(global) |
|
assign |
command line |
no |
(global) |
|
open_file |
UI, command line debug |
no |
(global) |
|
markgold |
UI, command line debug |
no |
core |
|
unmarkgold |
UI, command line debug |
no |
core |
|
save |
UI, command line debug |
no |
core, review, reconciliation |
|
inspection |
list |
UI, command line |
no |
(global) |
workspace_configuration |
command line |
no |
(global) |
|
dump_database |
command line |
no |
(global) |
|
logging |
enable_logging |
command line |
no |
(global) |
disable_logging |
command line |
no |
(global) |
|
rerun_log |
command line |
no |
(global) |
|
users |
register_users |
command line |
no |
(global) |
list_users |
command line |
no |
(global) |
|
add_roles |
command line |
no |
(global) |
|
remove_roles |
command line |
no |
(global) |
|
automated tagging |
modelbuild |
command line |
yes |
core |
advance |
UI, command line |
yes |
core |
|
experimentation |
list_basename_sets |
command line |
no |
(global) |
add_to_basename_set |
command line |
no |
(global) |
|
remove_from_basename_set |
command line |
no |
(global) |
|
run_experiment |
command line |
no |
(global) |
|
review and
reconciliation |
schedule_review |
command line |
no |
(global) |
unschedule_review |
command line |
no |
(globa) |
|
list_review_schedule |
command line |
no |
(global) |
|
apply_crossvalidation |
command line |
yes |
core |
|
remove_from_reconciliation |
command line |
no |
reconciliation |
|
request_review |
command line |
no |
core |
|
complete_human_review |
UI, command line |
no |
review |
|
administration |
force_unlock |
command line |
no |
core, review, reconciliation |
There are also internal operations which are not publicly visible (release_lock, update_ui_log).
We'll review each of these operations in turn.
The create operation creates a workspace. It requires a task and an initial user. If the workspace supports multiple languages, similarity profiles, or workspace configurations, these must be supplied as well.
This operation is available only on
the command line.
The import operation ingests
documents into the workspace. The documents are all converted to
MAT JSON format, and are prepared for annotation. You can
optionally assign documents to users.
This operation is only available on the command line.
Historically, the import operation could target multiple folders, but as of MAT 2.0, only the core folder is eligible for import.
In task.xml, by creating a workspace
configuration, you can customize the default by which documents
are prepared for annotation when they're imported. The key/value
pairs here are the same as the ones available to MATEngine and the steps it executes,
with the caveat that workflow and steps are specified for you, and
the documents to be imported, and their file types, are specified
as options to the import operation itself. Say, for instance, that
you wanted to provide additional tokenizer patterns to the
tokenizer provided with jCarafe.
Here's how you'd do it:
<workspaces default_config="Demo">
...
<workspace workflow="Demo" config_name="Custom Demo">
...
<operation name="import">
<settings tokenizer_patterns="..."/>
</operation>
...
</workspace>
</workspaces>
In addition, the import operation can
be augmented using the --workflows and --steps options described
in MATWorkspaceEngine.
Finally, note that any advancement
after documents are marked gold or reconciled on import are
governed by the customizations to the "advance"
operation.
The remove operation removes all
copies of the basename from the workspace. Warning: this operation will
remove all traces of the basenames from the workspace folders and
the database. Do not use it unless you really want them removed.
This operation is only available on the command line.
This operation assigns the specified
basenames to the specified users. Each user gets his or her own
copy of the document to annotate. If there are no available documents corresponding to the basename which haven't already been altered by a human, the basename cannot be
assigned.
This operation is only available on
the command line.
This operation opens a workspace file
and returns its contents. It also locks the workspace file in the
workspace database. This lock is typically released when a file is
closed in the UI, using the private release_lock operation. If
this document is "stranded" - if, for instance, a user forgets to
close the document - you can use the force_unlock
operation to fix this.
This operation is available in the MAT UI, or on the command line if
--debug is provided.
This operation marks all of the
"non-gold" segments in a document "human
gold" for the current hand annotatable step, and records the
step as done. Then, by default, it checks for scheduled reviews;
if it finds a scheduled review, it submits the document for
review, and if no reviews are found, it advances the document to
the next hand-annotatable step. This automatic advancement can be
customized through the "advance" operation.
This operation is available in the MAT
UI, or indirectly on the command line via the import operation, or
on the command line
if --debug is provided. When used in the UI, it will trigger a
save operation first if the document has unsaved changes. The UI
also has an option to mark gold without advancement; you should
use this option if you want to request a review.
This operation marks all of the "human gold" or "reconciled" segments in a document "non-gold", and marks the step undone for that document.
This operation saves the contents of a
workspace file.
This operation is available in the MAT UI, or on the command line if --debug
is provided.
This operation shows you the contents
of the folders in the workspace. The listing shows you the status
of the document, as well as who it's assigned
to.
It is available both on the command line, and in the MAT UI as part of the workspace interface.
This operation describes a number of properties of the workspace. The properties reported are:
- Task: the name of the task that the workspace uses
- Users: the workspace users that are registered
- Workflow: the workflow and workspace configuration that the workspace relies on
- Language: the language of the workspace
- Logging: whether or not workspace logging is enabled
- Prioritization: in a future release, MAT may support prioritization queues, to enable techniques such as active learning. This capability is currently disabled.
This operation describes all the
tables in the workspace
database. It is a useful debugging tool for the technically
inclined.
This operation is only available on
the command line.
MAT provides a rich and extensive logging infrastructure
specifically for workspaces. When logging is enabled, MAT
workspace operations log every action and data modification, so
that the activities in the workspace can be rerun from the point
that logging was enabled, exactly as they were originally
performed.
Workspace logging is distinct from UI logging. The
MAT UI has the capability of capturing all the user gestures, and
save these gestures to a CSV file at the user's request. If
workspace logging is enabled, the UI turns on this capability
specifically for that workspace. This workspace-specific UI logging capability captures the same information as the general UI logging, but differs from the general UI logging in a number of ways:
general UI logging | workspace UI logging |
---|---|
Enabled and disabled in the UI | Enabled and disabled on the command line along with general workspace logging; no UI controls available |
Captures UI activity for all windows | Captures UI activity for windows associated with the logging-enabled workspace |
Saves the log to a CSV file | Saves the log to a JSON file |
Saves the log in a location determined by the user's browser preferences | Saves the log in the workspace logging directory |
Saves the log when UI logging is disabled | Saves the log when modified workspace files are saved |
For each save, records UI activity since the last time UI logging was enabled | For each save, records UI activity since the last log save, or since the workspace was opened in the UI |
If you also choose to enable general UI logging, you'll get all the expected gestures in your general UI log, including those that are captured for workspace logging.
This operation enables the logging. The log will be saved in the _checkpoint subdirectory of the workspace directory.
This operation is available on the command line.
This operation disables logging. If a log is being collected, by default it is moved to the first available _checkpoint_<n> path. However, the user can force the log to be disabled if she chooses. In either case, this ensures that _checkpoint never contains a discontinuous log.
This operation is available on the command line.
This operation allows you to rerun the log. It will use the _checkpoint/_rerun subdirectory of the workspace directory to store the rerun state. You can use this capability to recreate any intermediate state of your workspace, e.g., for experiment analysis.
This operation is available on the command line.
Workspace users have roles which
say what they can do in the workspace, but by default, users have
only one available role, "annotator", which means the user is
eligible to perform annotation. An optional "reviewer" role can be
assigned to users, which means they can review or reconcile
documents.. The role "all" is a shorthand for both roles.
You can explicitly specify user roles which you register the
users, or afterward. You may want to vary the available roles for
annotators because, e.g., you may want only some of them to
participate in particular reconciliation phases; say, you might
want only some annotators to be able to perform the decisive
human_decision reconciliation step.
This operation allows you to add
registered users to your
workspace. Perhaps you want to be able to track the contributions
of multiple annotators, or you might want to actually assign documents to multiple annotators and
do multiple annotation. You may also want to assign roles to your
users. You cannot unregister users once they're registered,
although you can remove all their roles.
This operation is only available on
the command line.
This operation lists the users in a workspace. It is also
available as part of the workspace_configuration
operation.
This operation is only available on
the command line.
The add_roles operation adds roles to existing users.
This operation is only available on the command line.
The remove_roles operation removes roles from existing users.
This operation is only available on the command line.
By default, the workspace will attempt to ensure that each file is positioned at an opportunity for user interaction. When a file is imported, the workspace advances the file to the first hand-annotatable step; when the user marks a document gold in a given step, the workspace attempts to advance to the next hand-annotatable step (assuming no reviews are scheduled). If a model exists for a given step, it will be applied to documents in the appropriate circumstances.
This operation builds a model which
can be used to automatically tag other documents. Every document
in the workspace which is gold or reconciled for the relevant
annotation set is used to build this model. If there are multiple
copies of a document because the document is multiply assigned,
all copies will be used (so that document will be overrepresented
in the model, and all conflicting annotations will be used as
well). You can optionally ask the workspace to autotag documents
after the model is built.
Note:
the workspace model is completely
distinct from the default task model.
This operation is only available on
the command line.
In task.xml, by creating a workspace
configuration, you can customize your modelbuild operation, e.g.,
restrict it to just the gold segments. You can use any setting
that's available to the training engine.
<workspaces default_config="Demo">
...
<workspace workflow="Demo" config_name="Custom Demo">
...
<operation name="modelbuild">
<settings partial_training_on_gold_only="yes"/>
</operation>
...
</workspace>
</workspaces>
By default, documents advance
automatically to the next hand-annotatable step. Several
operations permit you to suppress advancement. If you do, you can
complete the advancement later using this operation. This
operation automatically advances the document to the next
hand-annotatable point, or to the end of the workflow if there are
no more hand-annotatable points. You can specify individual
basenames to process, or process all documents.
Note: this operation does
not use the jCarafe tagging server, even in the UI. So the startup
cost of the tagging engine is incurred each time the autotag
operation is executed. This operation also does not use the
default task model, ever; it only uses models constructed using
the modelbuild operation.
This operation is available in the MAT UI (for individual documents) and on
the command line.
When used in the UI, it will trigger a save operation first if the
document has unsaved changes.
In task.xml, by creating a workspace
configuration, you can customize how automated advancement
happens. This customization will apply not just to explicit
invocations of the "advance" operation, but also to every
operation which automatically advances (e.g., markgold,
complete_human_review). The one exception is import, which whose
initial processing is governed by customizations to the "import" operation; however, any advancement
after marking gold or reconciled on import is covered by the
customizations here.
The key/value pairs here are the same as the ones available to MATEngine and the steps it executes, with the caveat that workflow, steps, documents and file types are computed for you. Because these customizations apply to all advancements within the workspace's workflow, you should provide all the options you'd want (beyond the initial import processing). If, for instance, your workflow contains a step which accepts the "allow_foo" option, you can specify it here and it will passed to that step when it's applied, and otherwise ignored:
<workspaces default_config="Demo">
...
<workspace workflow="Demo" config_name="Custom Demo">
...
<operation name="advance">
<settings allow_foo="yes"/>
</operation>
...
</workspace>
</workspaces>
You can use your workspace as a corpus for experiments. You can access this capability via the <workspace_corpora> element for MATExperimentEngine, or you can access it via the workspace engine. You can further subdivide your workspace into basename sets which can be referred to in your experiment.
This operation lists the basename sets and their contents. This operation is only available on the command line.
This operation adds basenames to a given basename set (and implicitly creates the set if necessary). This operation is only available on the command line.
This operation removes basenames from a given basename set (and implicitly removes the set if necessary). This operation is only available on the command line.
This operation allows you to run an experiment based on this workspace, either using an experiment file or by specifying the properties of the test set in terms of properties of the workspace basenames. This operation is only available on the command line.
This operation forces a basename in
the named folder to be unlocked. In the reconciliation and review
folders, it will advance to the next hand-annotatable step by
default.
Warning:
be very certain that you apply the force_unlock operation only to basenames whose locks
have been stranded. If you unlock a basename which is being
annotated, the annotator will not be able to save her changes.
This operation is only available on
the command line.
Unlike file mode, workspace mode is stateful from the point of view of the UI. It is
the server, rather than the client, which loads and saves the
files. However, we don't want just anybody to be able to cause the
server to perform these stateful operations, so the MAT web server implements some security mechanisms.
Note, however, that the MAT workspace functionality is not an enterprise-secure implementation, and will never be one. It does not use SSL; it does not perform any sort of user authentication beyond the workspace key; it does not provide any security logging or traceability; and it does not currently implement transactions. You should assume that anyone who has access to your network can see your workspace traffic, and overwrite your data.
Note that workspace users play no
role in workspace security.
Workspaces maintain an internal lock to ensure that any operations which change the state of the workspace are exclusive. This locking mechanism is quite simple - it relies on the presence or absence of the "opLockfile" file. If something goes horribly wrong, it's possible that the workspace may get in a stranded state, where it fails to remove "opLockfile" at the end of the operation. If you're getting a notification that the workspace is in use, and you're sure it's not, you can remove the file by hand. As an added bonus, the file contents will tell you what operation was being performed by which user, and what time the lock was established.
Workspaces support the option of reviewing documents after
they're annotated. You can schedule a review in advance, for any
document that completes a particular step, or, if there's no
existing schedule, you can request an ad-hoc review after you
complete a step. Finally, you can use a requested review to repair
errors in previous steps.
There are four types of document review:
review type |
target folder |
availability |
relevant operations |
how does it work? |
---|---|---|---|---|
human |
review |
schedule, ad-hoc |
schedule_review,
unschedule_review, list_review_schedule, request_review, complete_human_review |
The document is copied from the core folder to the review folder. An annotator with review privileges (other than the one who last annotated the document in the core folder) reviews the document, and applies the "Save & Done" operation in the UI when satisfied. Once the review document is complete, it is copied back to the core folder and marked reconciled for the step just completed. |
reconciliation |
reconciliation |
schedule |
schedule_review,
unschedule_review, list_review_schedule, remove_from_reconciliation |
This review is intended to be used when documents are multiply assigned. When this type of review is scheduled for a step and an annotator completes that step, the document is placed in a "suspended" state until all the versions of this document have completed the relevant step. At that point, a reconciliation document is created and inserted into the reconciliation folder, and an annotator with review privileges reconciles the conflicts in the reconciliation document, and applies the "Save & Done" operation, which closes the completely reconciled document. Once the reconciliation document is closed, it is converted back into a normal document and copied back into the core folder, replacing the documents which were submitted for the review. These now-reviewed documents are marked reconciled for the step just completed. At this point, all the copies of the document will be identical. |
reconciliation_with_crossvalidation |
reconciliation |
schedule, ad-hoc |
schedule_review, unschedule_review, list_review_schedule, apply_crossvalidation, remove_from_reconciliation, request_review | This review is like reconciliation review,
except that it can be used with single assignment, or no
assignment at all; in fact, when it's an ad-hoc request,
there must only be one copy of the document in the
workspace. When this type of review is triggered by a
schedule, or requested ad-hoc, its "suspended" state also
involves awaiting cross-validation. Once the user is
satisfied that enough documents have accumulated to do
cross-validation, she calls the apply_crossvalidation
operation, which creates another copy of the document, based
on cross-validation-trained models. This additional document
copy is added to the reconciliation document, and the review
proceeds as above. |
repair |
review |
ad-hoc |
request_review,
complete_human_review |
This review is a special kind of human
review, in which the reviewing user does not require the
reviewer role; is the same as the last person who touched
the document; and does not mark the document reconciled when
it's completed. It's intended for special situations where
you've made a mistake in a previous workspace step (which
you can't return to). |
This operation allows you to schedule a review. This operation is only available on the command line.
This operation allows you to remove a scheduled review. This operation is only available on the command line.
This operation will list the scheduled reviews, by step. This operation is only available on the command line.
Use this operation to apply crossvalidation to accumulated documents which are waiting for it. In general, you should allow a reasonable number of documents to accumulate awaiting crossvalidation before you trigger it, since otherwise, it'll essentially do the same thing that autotagging does.
This operation is only available on the command line.
In task.xml, by creating a workspace
configuration, you can customize the crossvalidation defaults.
Here's how you'd do it:
<workspaces default_config="Demo">
...
<workspace workflow="Demo" config_name="Custom Demo">
...
<operation name="apply_crossvalidation">
<settings folds="..."/>
</operation>
...
</workspace>
</workspaces>
If, for some reason, a document fails to exit reconciliation naturally (if some of the users fail to complete their reconciliation steps, for example), you can use this operation to remove the document forcibly from reconciliation. You have the option of discarding the reconciliation decisions that were made. By default, this operation will advance the document to the next hand-annotatable step. This operation is only available on the command line.
If the current step isn't scheduled for review or reconciliation, you can request a review yourself, if you want one. Only human review and reconciliation with crossvalidation are available; you can't request a review for a document assigned to someone else.
The 'repair' review type is special; it's equivalent to requesting a human review which you'll conduct yourself, on a document which isn't complete in its current step.
This operation is only available on the command line.
If a document is in the human review folder, you can indicate that you're satisfied with the document with this operation. If the document isn't being reviewed for repair, this operation will mark the document reconciled for the current step, and then advance the document to the next hand-annotatable step. This operation is only available on the command line, or in the UI via the "Save & Done" operation in the review folder.
The workspace database is an SQLite database which tracks the
status of documents, users, and the workspace itself. The schema
can be found in MAT_PKG_HOME/lib/mat/python/MAT/ws_db.sql. The
tables are:
You may realize, once you've completed an import operation, that
you didn't import the basenames the way you'd wanted; perhaps
you'd intended to strip a suffix, or you assigned them to the
wrong workspace user. You can use the remove operation to remove
the basenames from the workspace in preparation for re-importing.
Warning: this operation
will remove all traces of the basenames from the workspace folders
and the database. Do not use it unless you really want them
removed.
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove basename1...
If you're not sure what basenames are available, the --help
option will list them:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove --help
More on the remove operation here.
The workspaces do not permit documents to be edited by more than
one annotator at a time. The workspaces achieve this exclusivity
through the use of file locks, which are recorded in the workspace
database. When an annotator opens a document for annotation, the
annotation UI is given a lock ID which it can use to release the
document when the editing session is over. In some circumstances,
unfortunately, the document is not unlocked; for instance, if the
UI encounters an unexpected error and crashes before unlocking the
document. You can use the force_unlock operation to clear this
lock from the database.
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> force_unlock --user user1 core basename1
If you just want to unlock everything, don't specify any
basenames. If you want to know what's locked, use the
dump_database operation:
% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> dump_database
This will show you the content of the workspace database tables.
Warning: be very certain
that you apply the force_unlock operation only to basenames whose locks
have been stranded. If you unlock a basename which is being
annotated, the annotator will not be able to save her changes.
More on force_unlock here.
If you get this error message, and you're absolutely certain that
no one else is working on the workspace, something horrible has
happened, and a previous operation has failed in such a way to
fail to remove the "opLockfile" file. More on how to deal with
this here.