Using workspaces

Workspaces provide a guided, structured way of managing and processing your documents. Make sure that this is what you want. Workspace mode is provided by MATWorkspaceEngine on the command line, and via "File -> Open workspace..." in the Web UI. You can find a summary of the highlights about using workspaces here; this document provides the details.

The structure of the workspace directory

Workspaces are just directories. The structure of these directories looks like this:

Workspaces, workflows and workspace configurations

When you invoke MATEngine, the file mode workflow processing tool, you have to provide a good deal of information, and you have to know a good deal about your document:

Workspaces track these things for you, and much more. When you create your workspace, you specify a task, language, and workflow, and from that point on, the workspace will manage the progress of each document through the workflow for you.

When you create a workspace, you can specify the workflow using the --workspace_config option. The value of this option is either a workflow name, or the name of a workspace configuration based on that workflow which customizes some of the operations used in your workspace (see here, here, here, and here). You can define workspace configurations in your task.xml file.

The goal of the workspace is to try to ensure that each document the workspace is processing is always ready for hand annotation. So when you import documents into a workspace, they'll be advanced to the first hand-annotatable step in the workspace's workflow; when you declare that you're done with a step, the document will be advanced to the next hand-annotatable step (if any). All intervening automated processing, including applying the appropriate trainable models, is done for you. The workspace keeps track of the current step the document is in.

Document state

In MAT, all documents in workspaces are now closely tracked for their annotation state in the current step. In the current step, documents can be:

Workspace users

The document state tracking in the workspaces includes tracking who modified the annotations in the documents. As a result, every document edit in workspaces is linked to a workspace user.

The inventory of users of a workspace is entirely up to its creators and managers. Every workspace must be created with at least one initial user. The names of these users are not bound to any external resource; they're not required to be the same as login names, for instance. They're merely there to provide a way of attributing document changes. There's no account management or passwords; you can "claim" to be any registered user you want to claim to be when you edit a workspace. We're assuming that you're using MAT workspaces in a cooperative environment in which this sort of inappropriate behavior won't arise.

Although there's no requirement that registered user names correspond to external resources like login names, you may find it easiest to use login names anyway, so that your workspace annotators don't have to remember a different name when they open a workspace.

Documents may be editable by any workspace user, or might be assigned to a particular user. If a document is assigned to someone other than you, you'll be able to view it, but not edit it, in the UI.

Workspace users are assigned roles, which indicate what they can do within the workspace. By default, all users can annotate documents in the core folder. Users may also have an optional 'reviewer' role, which allows them to perform human reviews of other annotators' work and to reconcile documents.

Workspace operations

The available operations are:

topic
operation
availability
configurable in workspace configuration
folder
creation
create
command line
no
(global)
file management
import
command line
yes
(global)
remove
command line
no
(global)
assign
command line
no
(global)
open_file
UI, command line debug
no
(global)
markgold
UI, command line debug
no
core
unmarkgold
UI, command line debug
no
core
save
UI, command line debug
no
core, review, reconciliation
inspection
list
UI, command line
no
(global)
workspace_configuration
command line
no
(global)
dump_database
command line
no
(global)
logging
enable_logging
command line
no
(global)
disable_logging
command line
no
(global)
rerun_log
command line
no
(global)
users
register_users
command line
no
(global)
list_users
command line
no
(global)
add_roles
command line
no
(global)
remove_roles
command line
no
(global)
automated tagging

modelbuild
command line
yes
core
advance
UI, command line
yes
core
experimentation
list_basename_sets
command line
no
(global)
add_to_basename_set
command line
no
(global)
remove_from_basename_set
command line
no
(global)
run_experiment
command line
no
(global)
review and reconciliation
schedule_review
command line
no
(global)
unschedule_review
command line
no
(globa)
list_review_schedule
command line
no
(global)
apply_crossvalidation
command line
yes
core
remove_from_reconciliation
command line
no
reconciliation
request_review
command line
no
core
complete_human_review
UI, command line
no
review
administration
force_unlock
command line
no
core, review, reconciliation

There are also internal operations which are not publicly visible (release_lock, update_ui_log).

We'll review each of these operations in turn.

Creation

create

The create operation creates a workspace. It requires a task and an initial user. If the workspace supports multiple languages, similarity profiles, or workspace configurations, these must be supplied as well.

This operation is available only on the command line.

File management

import

The import operation ingests documents into the workspace. The documents are all converted to MAT JSON format, and are prepared for annotation. You can optionally assign documents to users.

This operation is only available on the command line.

Historically, the import operation could target multiple folders, but as of MAT 2.0, only the core folder is eligible for import.

Configuring the import operation in task.xml

In task.xml, by creating a workspace configuration, you can customize the default by which documents are prepared for annotation when they're imported. The key/value pairs here are the same as the ones available to MATEngine and the steps it executes, with the caveat that workflow and steps are specified for you, and the documents to be imported, and their file types, are specified as options to the import operation itself. Say, for instance, that you wanted to provide additional tokenizer patterns to the tokenizer provided with jCarafe. Here's how you'd do it:

  <workspaces default_config="Demo">
...
<workspace workflow="Demo" config_name="Custom Demo">
...
<operation name="import">
<settings tokenizer_patterns="..."/>
</operation>
...
</workspace>
</workspaces>

In addition, the import operation can be augmented using the --workflows and --steps options described in MATWorkspaceEngine.

Finally, note that any advancement after documents are marked gold or reconciled on import are governed by the customizations to the "advance" operation.

remove

The remove operation removes all copies of the basename from the workspace. Warning: this operation will remove all traces of the basenames from the workspace folders and the database. Do not use it unless you really want them removed.

This operation is only available on the command line.

assign

This operation assigns the specified basenames to the specified users. Each user gets his or her own copy of the document to annotate. If there are no available documents corresponding to the basename which haven't already been altered by a human, the basename cannot be assigned.

This operation is only available on the command line.

open_file

This operation opens a workspace file and returns its contents. It also locks the workspace file in the workspace database. This lock is typically released when a file is closed in the UI, using the private release_lock operation. If this document is "stranded" - if, for instance, a user forgets to close the document - you can use the force_unlock operation to fix this.

This operation is available in the MAT UI, or on the command line if --debug is provided.

markgold

This operation marks all of the "non-gold" segments in a document "human gold" for the current hand annotatable step, and records the step as done. Then, by default, it checks for scheduled reviews; if it finds a scheduled review, it submits the document for review, and if no reviews are found, it advances the document to the next hand-annotatable step. This automatic advancement can be customized through the "advance" operation.

This operation is available in the MAT UI, or indirectly on the command line via the import operation, or on the command line if --debug is provided. When used in the UI, it will trigger a save operation first if the document has unsaved changes. The UI also has an option to mark gold without advancement; you should use this option if you want to request a review.

unmarkgold

This operation marks all of the "human gold" or "reconciled" segments in a document "non-gold", and marks the step undone for that document.

This operation is available in the MAT UI, or on the command line if --debug is provided. When used in the UI, it will trigger a save operation first if the document has unsaved changes.

save

This operation saves the contents of a workspace file.

This operation is available in the MAT UI, or on the command line if --debug is provided.

Inspection

list

This operation shows you the contents of the folders in the workspace. The listing shows you the status of the document, as well as who it's assigned to.

It is available both on the command line, and in the MAT UI as part of the workspace interface.

workspace_configuration

This operation describes a number of properties of the workspace. The properties reported are:

dump_database

This operation describes all the tables in the workspace database. It is a useful debugging tool for the technically inclined.

This operation is only available on the command line.

Logging

MAT provides a rich and extensive logging infrastructure specifically for workspaces. When logging is enabled, MAT workspace operations log every action and data modification, so that the activities in the workspace can be rerun from the point that logging was enabled, exactly as they were originally performed.

Workspace logging is distinct from UI logging. The MAT UI has the capability of capturing all the user gestures, and save these gestures to a CSV file at the user's request. If workspace logging is enabled, the UI turns on this capability specifically for that workspace. This workspace-specific UI logging capability captures the same information as the general UI logging, but differs from the general UI logging in a number of ways:

general UI logging
workspace UI logging
Enabled and disabled in the UI
Enabled and disabled on the command line along with general workspace logging; no UI controls available
Captures UI activity for all windows
Captures UI activity for windows associated with the logging-enabled workspace
Saves the log to a CSV file
Saves the log to a JSON file
Saves the log in a location determined by the user's browser preferences
Saves the log in the workspace logging directory
Saves the log when UI logging is disabled
Saves the log when modified workspace files are saved
For each save, records UI activity since the last time UI logging was enabled
For each save, records UI activity since the last log save, or since the workspace was opened in the UI

If you also choose to enable general UI logging, you'll get all the expected gestures in your general UI log, including those that are captured for workspace logging.

enable_logging

This operation enables the logging.  The log will be saved in the _checkpoint subdirectory of the workspace directory.

This operation is available on the command line.

disable_logging

This operation disables logging. If a log is being collected, by default it is moved to the first available _checkpoint_<n> path. However, the user can force the log to be disabled if she chooses. In either case, this ensures that _checkpoint never contains a discontinuous log.

This operation is available on the command line.

rerun_log

This operation allows you to rerun the log. It will use the _checkpoint/_rerun subdirectory of the workspace directory to store the rerun state. You can use this capability to recreate any intermediate state of your workspace, e.g., for experiment analysis.

This operation is available on the command line.

Users

Workspace users have roles which say what they can do in the workspace, but by default, users have only one available role, "annotator", which means the user is eligible to perform annotation. An optional "reviewer" role can be assigned to users, which means they can review or reconcile documents.. The role "all" is a shorthand for both roles.

You can explicitly specify user roles which you register the users, or afterward. You may want to vary the available roles for annotators because, e.g., you may want only some of them to participate in particular reconciliation phases; say, you might want only some annotators to be able to perform the decisive human_decision reconciliation step.

register_users

This operation allows you to add registered users to your workspace. Perhaps you want to be able to track the contributions of multiple annotators, or you might want to actually assign documents to multiple annotators and do multiple annotation. You may also want to assign roles to your users. You cannot unregister users once they're registered, although you can remove all their roles.

This operation is only available on the command line.

list_users

This operation lists the users in a workspace. It is also available as part of the workspace_configuration operation.

This operation is only available on the command line.

add_roles

The add_roles operation adds roles to existing users.

This operation is only available on the command line.

remove_roles

The remove_roles operation removes roles from existing users.

This operation is only available on the command line.

Automated tagging

By default, the workspace will attempt to ensure that each file is positioned at an opportunity for user interaction. When a file is imported, the workspace advances the file to the first hand-annotatable step; when the user marks a document gold in a given step, the workspace attempts to advance to the next hand-annotatable step (assuming no reviews are scheduled). If a model exists for a given step, it will be applied to documents in the appropriate circumstances.

modelbuild

This operation builds a model which can be used to automatically tag other documents. Every document in the workspace which is gold or reconciled for the relevant annotation set is used to build this model. If there are multiple copies of a document because the document is multiply assigned, all copies will be used (so that document will be overrepresented in the model, and all conflicting annotations will be used as well). You can optionally ask the workspace to autotag documents after the model is built.

Note: the workspace model is completely distinct from the default task model.

This operation is only available on the command line.

Configuring the modelbuild operation in task.xml

In task.xml, by creating a workspace configuration, you can customize your modelbuild operation, e.g., restrict it to just the gold segments. You can use any setting that's available to the training engine.

  <workspaces default_config="Demo">
...
<workspace workflow="Demo" config_name="Custom Demo">
...
<operation name="modelbuild">
<settings partial_training_on_gold_only="yes"/>
</operation>
...
</workspace>
</workspaces>

advance

By default, documents advance automatically to the next hand-annotatable step. Several operations permit you to suppress advancement. If you do, you can complete the advancement later using this operation. This operation automatically advances the document to the next hand-annotatable point, or to the end of the workflow if there are no more hand-annotatable points. You can specify individual basenames to process, or process all documents.

Note: this operation does not use the jCarafe tagging server, even in the UI. So the startup cost of the tagging engine is incurred each time the autotag operation is executed. This operation also does not use the default task model, ever; it only uses models constructed using the modelbuild operation.

This operation is available in the MAT UI (for individual documents) and on the command line. When used in the UI, it will trigger a save operation first if the document has unsaved changes.

Configuring the advance operation in task.xml

In task.xml, by creating a workspace configuration, you can customize how automated advancement happens. This customization will apply not just to explicit invocations of the "advance" operation, but also to every operation which automatically advances (e.g., markgold, complete_human_review). The one exception is import, which whose initial processing is governed by customizations to the "import" operation; however, any advancement after marking gold or reconciled on import is covered by the customizations here.

The key/value pairs here are the same as the ones available to MATEngine and the steps it executes, with the caveat that workflow, steps, documents and file types are computed for you. Because these customizations apply to all advancements within the workspace's workflow, you should provide all the options you'd want (beyond the initial import processing). If, for instance, your workflow contains a step which accepts the "allow_foo" option, you can specify it here and it will passed to that step when it's applied, and otherwise ignored:

  <workspaces default_config="Demo">
...
<workspace workflow="Demo" config_name="Custom Demo">
...
<operation name="advance">
<settings allow_foo="yes"/>
</operation>
...
</workspace>
</workspaces>

Experimentation

You can use your workspace as a corpus for experiments. You can access this capability via the <workspace_corpora> element for MATExperimentEngine, or you can access it via the workspace engine. You can further subdivide your workspace into basename sets which can be referred to in your experiment.

list_basename_sets

This operation lists the basename sets and their contents. This operation is only available on the command line.

add_to_basename_set

This operation adds basenames to a given basename set (and implicitly creates the set if necessary). This operation is only available on the command line.

remove_from_basename_set

This operation removes basenames from a given basename set (and implicitly removes the set if necessary). This operation is only available on the command line.

run_experiment

This operation allows you to run an experiment based on this workspace, either using an experiment file or by specifying the properties of the test set in terms of properties of the workspace basenames. This operation is only available on the command line.

Administration

force_unlock

This operation forces a basename in the named folder to be unlocked. In the reconciliation and review folders, it will advance to the next hand-annotatable step by default.

Warning: be very certain that you apply the force_unlock operation only to basenames whose locks have been stranded. If you unlock a basename which is being annotated, the annotator will not be able to save her changes.

This operation is only available on the command line.

Workspace security

Unlike file mode, workspace mode is stateful from the point of view of the UI. It is the server, rather than the client, which loads and saves the files. However, we don't want just anybody to be able to cause the server to perform these stateful operations, so the MAT web server implements some security mechanisms.

Note, however, that the MAT workspace functionality is not an enterprise-secure implementation, and will never be one. It does not use SSL; it does not perform any sort of user authentication beyond the workspace key; it does not provide any security logging or traceability; and it does not currently implement transactions. You should assume that anyone who has access to your network can see your workspace traffic, and overwrite your data.

Note that workspace users play no role in workspace security.

Workspace locking

Workspaces maintain an internal lock to ensure that any operations which change the state of the workspace are exclusive. This locking mechanism is quite simple - it relies on the presence or absence of the "opLockfile" file. If something goes horribly wrong,  it's possible that the workspace may get in a stranded state, where it fails to remove "opLockfile" at the end of the operation. If you're getting a notification that the workspace is in use, and you're sure it's not, you can remove the file by hand. As an added bonus, the file contents will tell you what operation was being performed by which user, and what time the lock was established.

Advanced topic: workspace review and reconciliation

Workspaces support the option of reviewing documents after they're annotated. You can schedule a review in advance, for any document that completes a particular step, or, if there's no existing schedule, you can request an ad-hoc review after you complete a step. Finally, you can use a requested review to repair errors in previous steps.

There are four types of document review:

review type
target folder
availability
relevant operations
how does it work?
human
review
schedule, ad-hoc
schedule_review, unschedule_review, list_review_schedule, request_review, complete_human_review
The document is copied from the core folder to the review folder. An annotator with review privileges (other than the one who last annotated the document in the core folder) reviews the document, and applies the "Save & Done" operation in the UI when satisfied. Once the review document is complete, it is copied back to the core folder and marked reconciled for the step just completed.
reconciliation
reconciliation
schedule
schedule_review, unschedule_review, list_review_schedule, remove_from_reconciliation
This review is intended to be used when documents are multiply assigned. When this type of review is scheduled for a step and an annotator completes that step, the document is placed in a "suspended" state until all the versions of this document have completed the relevant step. At that point, a reconciliation document is created and inserted into the reconciliation folder, and an annotator with review privileges reconciles the conflicts in the reconciliation document, and applies the "Save & Done" operation, which closes the completely reconciled document. Once the reconciliation document is closed, it is converted back into a normal document and copied back into the core folder, replacing the documents which were submitted for the review. These now-reviewed documents are marked reconciled for the step just completed. At this point, all the copies of the document will be identical.
reconciliation_with_crossvalidation
reconciliation
schedule, ad-hoc
schedule_review, unschedule_review, list_review_schedule, apply_crossvalidation, remove_from_reconciliation, request_review This review is like reconciliation review, except that it can be used with single assignment, or no assignment at all; in fact, when it's an ad-hoc request, there must only be one copy of the document in the workspace. When this type of review is triggered by a schedule, or requested ad-hoc, its "suspended" state also involves awaiting cross-validation. Once the user is satisfied that enough documents have accumulated to do cross-validation, she calls the apply_crossvalidation operation, which creates another copy of the document, based on cross-validation-trained models. This additional document copy is added to the reconciliation document, and the review proceeds as above.
repair
review
ad-hoc
request_review, complete_human_review
This review is a special kind of human review, in which the reviewing user does not require the reviewer role; is the same as the last person who touched the document; and does not mark the document reconciled when it's completed. It's intended for special situations where you've made a mistake in a previous workspace step (which you can't return to).

schedule_review

This operation allows you to schedule a review. This operation is only available on the command line.

unschedule_review

This operation allows you to remove a scheduled review. This operation is only available on the command line.

list_review_schedule

This operation will list the scheduled reviews, by step. This operation is only available on the command line.

apply_crossvalidation

Use this operation to apply crossvalidation to accumulated documents which are waiting for it. In general, you should allow a reasonable number of documents to accumulate awaiting crossvalidation before you trigger it, since otherwise, it'll essentially do the same thing that autotagging does.

This operation is only available on the command line.

Configuring the apply_crossvalidation operation in task.xml

In task.xml, by creating a workspace configuration, you can customize the crossvalidation defaults. Here's how you'd do it:

  <workspaces default_config="Demo">
...
<workspace workflow="Demo" config_name="Custom Demo">
...
<operation name="apply_crossvalidation">
<settings folds="..."/>
</operation>
...
</workspace>
</workspaces>

remove_from_reconciliation

If, for some reason, a document fails to exit reconciliation naturally (if some of the users fail to complete their reconciliation steps, for example), you can use this operation to remove the document forcibly from reconciliation. You have the option of discarding the reconciliation decisions that were made. By default, this operation will advance the document to the next hand-annotatable step. This operation is only available on the command line.

request_review

If the current step isn't scheduled for review or reconciliation, you can request a review yourself, if you want one. Only human review and reconciliation with crossvalidation are available; you can't request a review for a document assigned to someone else.

The 'repair' review type is special; it's equivalent to requesting a human review which you'll conduct yourself, on a document which isn't complete in its current step.

This operation is only available on the command line.

complete_human_review

If a document is in the human review folder, you can indicate that you're satisfied with the document with this operation. If the document isn't being reviewed for repair, this operation will mark the document reconciled for the current step, and then advance the document to the next hand-annotatable step. This operation is only available on the command line, or in the UI via the "Save & Done" operation in the review folder.

Advanced topic: the workspace database

The workspace database is an SQLite database which tracks the status of documents, users, and the workspace itself. The schema can be found in MAT_PKG_HOME/lib/mat/python/MAT/ws_db.sql. The tables are:

Troubleshooting

Failed import

You may realize, once you've completed an import operation, that you didn't import the basenames the way you'd wanted; perhaps you'd intended to strip a suffix, or you assigned them to the wrong workspace user. You can use the remove operation to remove the basenames from the workspace in preparation for re-importing. Warning: this operation will remove all traces of the basenames from the workspace folders and the database. Do not use it unless you really want them removed.

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove basename1...

If you're not sure what basenames are available, the --help option will list them:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove --help

More on the remove operation here.

Locked files

The workspaces do not permit documents to be edited by more than one annotator at a time. The workspaces achieve this exclusivity through the use of file locks, which are recorded in the workspace database. When an annotator opens a document for annotation, the annotation UI is given a lock ID which it can use to release the document when the editing session is over. In some circumstances, unfortunately, the document is not unlocked; for instance, if the UI encounters an unexpected error and crashes before unlocking the document. You can use the force_unlock operation to clear this lock from the database.

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> force_unlock --user user1 core basename1

If you just want to unlock everything, don't specify any basenames. If you want to know what's locked, use the dump_database operation:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> dump_database

This will show you the content of the workspace database tables.

Warning: be very certain that you apply the force_unlock operation only to basenames whose locks have been stranded. If you unlock a basename which is being annotated, the annotator will not be able to save her changes.

More on force_unlock here.

Error "workspace is currently unavailable (processing another request)"

If you get this error message, and you're absolutely certain that no one else is working on the workspace, something horrible has happened, and a previous operation has failed in such a way to fail to remove the "opLockfile" file. More on how to deal with this here.