Using workspaces

Workspaces provide a guided, structured way of managing and processing your documents. Make sure that this is what you want. Workspace mode is provided by MATWorkspaceEngine on the command line, and via "File -> Open workspace..." in the Web UI. You can find a summary of the highlights about using workspaces here; this document provides the details.

The structure of the workspace directory
Workspaces, workflows and workspace configurations
Document state
Workspace users
Workspace operations

Creation
File management
Inspection
Logging
Users
Automated tagging
Experimentation
Review and reconciliation
Administration

Workspace security
Workspace locking
Advanced topic: workspace review and reconciliation
Advanced topic: the workspace database
Troubleshooting

The structure of the workspace directory

Workspaces are just directories. The structure of these directories looks like this:

folders (dir) - a directory containing the workspace folders, which are themselves directories. This directory will contain, at least:

core (dir)
review (dir)
reconciliation (dir)
export (dir) (not yet used)

models (dir) - a directory containing any models that are built during workspace operations

model_<language>_<step> (file) - the most recently generated jCarafe model for the step (may not exist)
model_<language>_<step>_basenames (file) - a file containing a list of the basenames used to create the most recent model for the step

opLockfile (file) - if present, the workspace is locked.
_checkpoint (dir) - if logging is enabled, contains the log
experiments (dir) - if experiments have been run, the output of the experiments

<date>_<secs>_<msecs> (dir) an experiment, which has the directory structure of an experiment

ws_db.db (file) - the workspace SQLite database, which contains metadata about the documents and the workspace itself

Workspaces, workflows and workspace configurations

When you invoke MATEngine, the file mode workflow processing tool, you have to provide a good deal of information, and you have to know a good deal about your document:

You must specify the task, workflow and steps you want to apply.
You may need to specify the language, if the task supports multiple languages.
You will have to know what's already been done to your document, and remember what you're doing to it.

Workspaces track these things for you, and much more. When you create your workspace, you specify a task, language, and workflow, and from that point on, the workspace will manage the progress of each document through the workflow for you.

When you create a workspace, you can specify the workflow using the --workspace_config option. The value of this option is either a workflow name, or the name of a workspace configuration based on that workflow which customizes some of the operations used in your workspace (see here, here, here, and here). You can define workspace configurations in your task.xml file.

The goal of the workspace is to try to ensure that each document the workspace is processing is always ready for hand annotation. So when you import documents into a workspace, they'll be advanced to the first hand-annotatable step in the workspace's workflow; when you declare that you're done with a step, the document will be advanced to the next hand-annotatable step (if any). All intervening automated processing, including applying the appropriate trainable models, is done for you. The workspace keeps track of the current step the document is in.

Document state

In MAT, all documents in workspaces are now closely tracked for their annotation state in the current step. In the current step, documents can be:

unannotated, which means that no annotations for the step have been added
uncorrected, which means that the document has been automatically annotated for the step, but no corrections have been made
partially corrected, which means that a human annotator has modified the annotations for the step, but hasn't marked them gold
gold, which means that a human annotator has judged the annotations to be complete for the step
reconciled, which means the completed annotations for the step have undergone some sort of review

Workspace users

The document state tracking in the workspaces includes tracking who modified the annotations in the documents. As a result, every document edit in workspaces is linked to a workspace user.

The inventory of users of a workspace is entirely up to its creators and managers. Every workspace must be created with at least one initial user. The names of these users are not bound to any external resource; they're not required to be the same as login names, for instance. They're merely there to provide a way of attributing document changes. There's no account management or passwords; you can "claim" to be any registered user you want to claim to be when you edit a workspace. We're assuming that you're using MAT workspaces in a cooperative environment in which this sort of inappropriate behavior won't arise.

Although there's no requirement that registered user names correspond to external resources like login names, you may find it easiest to use login names anyway, so that your workspace annotators don't have to remember a different name when they open a workspace.

Documents may be editable by any workspace user, or might be assigned to a particular user. If a document is assigned to someone other than you, you'll be able to view it, but not edit it, in the UI.

Workspace users are assigned roles, which indicate what they can do within the workspace. By default, all users can annotate documents in the core folder. Users may also have an optional 'reviewer' role, which allows them to perform human reviews of other annotators' work and to reconcile documents.

Workspace operations

The available operations are:

topic	operation	availability	configurable in workspace configuration	folder
creation	create	command line	no	(global)
file management	import	command line	yes	(global)
	remove	command line	no	(global)
	assign	command line	no	(global)
	open_file	UI, command line debug	no	(global)
	markgold	UI, command line debug	no	core
	unmarkgold	UI, command line debug	no	core
	save	UI, command line debug	no	core, review, reconciliation
inspection	list	UI, command line	no	(global)
	workspace_configuration	command line	no	(global)
	dump_database	command line	no	(global)
logging	enable_logging	command line	no	(global)
	disable_logging	command line	no	(global)
	rerun_log	command line	no	(global)
users	register_users	command line	no	(global)
	list_users	command line	no	(global)
	add_roles	command line	no	(global)
	remove_roles	command line	no	(global)
automated tagging	modelbuild	command line	yes	core
automated tagging	advance	UI, command line	yes	core
experimentation	list_basename_sets	command line	no	(global)
	add_to_basename_set	command line	no	(global)
	remove_from_basename_set	command line	no	(global)
	run_experiment	command line	no	(global)
review and reconciliation	schedule_review	command line	no	(global)
	unschedule_review	command line	no	(globa)
	list_review_schedule	command line	no	(global)
	apply_crossvalidation	command line	yes	core
	remove_from_reconciliation	command line	no	reconciliation
	request_review	command line	no	core
	complete_human_review	UI, command line	no	review
administration	force_unlock	command line	no	core, review, reconciliation

There are also internal operations which are not publicly visible (release_lock, update_ui_log).

We'll review each of these operations in turn.

Creation

create

The create operation creates a workspace. It requires a task and an initial user. If the workspace supports multiple languages, similarity profiles, or workspace configurations, these must be supplied as well.

This operation is available only on the command line.

File management

import

The import operation ingests documents into the workspace. The documents are all converted to MAT JSON format, and are prepared for annotation. You can optionally assign documents to users.

This operation is only available on the command line.

Historically, the import operation could target multiple folders, but as of MAT 2.0, only the core folder is eligible for import.

Configuring the import operation in task.xml

In task.xml, by creating a workspace configuration, you can customize the default by which documents are prepared for annotation when they're imported. The key/value pairs here are the same as the ones available to MATEngine and the steps it executes, with the caveat that workflow and steps are specified for you, and the documents to be imported, and their file types, are specified as options to the import operation itself. Say, for instance, that you wanted to provide additional tokenizer patterns to the tokenizer provided with jCarafe. Here's how you'd do it:

  <workspaces default_config="Demo">
    ...
    <workspace workflow="Demo" config_name="Custom Demo">
      ...
      <operation name="import">
        <settings tokenizer_patterns="..."/>
      </operation>
      ...
    </workspace>
  </workspaces>

In addition, the import operation can be augmented using the --workflows and --steps options described in MATWorkspaceEngine.

Finally, note that any advancement after documents are marked gold or reconciled on import are governed by the customizations to the "advance" operation.

remove

The remove operation removes all copies of the basename from the workspace. Warning: this operation will remove all traces of the basenames from the workspace folders and the database. Do not use it unless you really want them removed.

This operation is only available on the command line.

assign

This operation assigns the specified basenames to the specified users. Each user gets his or her own copy of the document to annotate. If there are no available documents corresponding to the basename which haven't already been altered by a human, the basename cannot be assigned.

This operation is only available on the command line.

open_file

This operation opens a workspace file and returns its contents. It also locks the workspace file in the workspace database. This lock is typically released when a file is closed in the UI, using the private release_lock operation. If this document is "stranded" - if, for instance, a user forgets to close the document - you can use the force_unlock operation to fix this.

This operation is available in the MAT UI, or on the command line if --debug is provided.

markgold

This operation marks all of the "non-gold" segments in a document "human gold" for the current hand annotatable step, and records the step as done. Then, by default, it checks for scheduled reviews; if it finds a scheduled review, it submits the document for review, and if no reviews are found, it advances the document to the next hand-annotatable step. This automatic advancement can be customized through the "advance" operation.

This operation is available in the MAT UI, or indirectly on the command line via the import operation, or on the command line if --debug is provided. When used in the UI, it will trigger a save operation first if the document has unsaved changes. The UI also has an option to mark gold without advancement; you should use this option if you want to request a review.

unmarkgold

This operation marks all of the "human gold" or "reconciled" segments in a document "non-gold", and marks the step undone for that document.

This operation is available in the MAT UI, or on the command line if --debug is provided. When used in the UI, it will trigger a save operation first if the document has unsaved changes.

save

This operation saves the contents of a workspace file.

This operation is available in the MAT UI, or on the command line if --debug is provided.

Inspection

list

This operation shows you the contents of the folders in the workspace. The listing shows you the status of the document, as well as who it's assigned to.

It is available both on the command line, and in the MAT UI as part of the workspace interface.

workspace_configuration

This operation describes a number of properties of the workspace. The properties reported are:

Task: the name of the task that the workspace uses

Users: the workspace users that are registered

Workflow: the workflow and workspace configuration that the workspace relies on

Language: the language of the workspace

Logging: whether or not workspace logging is enabled

Prioritization: in a future release, MAT may support prioritization queues, to enable techniques such as active learning. This capability is currently disabled.

dump_database

This operation describes all the tables in the workspace database. It is a useful debugging tool for the technically inclined.

This operation is only available on the command line.

Logging

MAT provides a rich and extensive logging infrastructure specifically for workspaces. When logging is enabled, MAT workspace operations log every action and data modification, so that the activities in the workspace can be rerun from the point that logging was enabled, exactly as they were originally performed.

Workspace logging is distinct from UI logging. The MAT UI has the capability of capturing all the user gestures, and save these gestures to a CSV file at the user's request. If workspace logging is enabled, the UI turns on this capability specifically for that workspace. This workspace-specific UI logging capability captures the same information as the general UI logging, but differs from the general UI logging in a number of ways:

general UI logging	workspace UI logging
Enabled and disabled in the UI	Enabled and disabled on the command line along with general workspace logging; no UI controls available
Captures UI activity for all windows	Captures UI activity for windows associated with the logging-enabled workspace
Saves the log to a CSV file	Saves the log to a JSON file
Saves the log in a location determined by the user's browser preferences	Saves the log in the workspace logging directory
Saves the log when UI logging is disabled	Saves the log when modified workspace files are saved
For each save, records UI activity since the last time UI logging was enabled	For each save, records UI activity since the last log save, or since the workspace was opened in the UI

If you also choose to enable general UI logging, you'll get all the expected gestures in your general UI log, including those that are captured for workspace logging.

enable_logging

This operation enables the logging. The log will be saved in the _checkpoint subdirectory of the workspace directory.

This operation is available on the command line.

disable_logging

This operation disables logging. If a log is being collected, by default it is moved to the first available _checkpoint_<n> path. However, the user can force the log to be disabled if she chooses. In either case, this ensures that _checkpoint never contains a discontinuous log.

This operation is available on the command line.

rerun_log

This operation allows you to rerun the log. It will use the _checkpoint/_rerun subdirectory of the workspace directory to store the rerun state. You can use this capability to recreate any intermediate state of your workspace, e.g., for experiment analysis.

This operation is available on the command line.

Users

Workspace users have roles which say what they can do in the workspace, but by default, users have only one available role, "annotator", which means the user is eligible to perform annotation. An optional "reviewer" role can be assigned to users, which means they can review or reconcile documents.. The role "all" is a shorthand for both roles.

You can explicitly specify user roles which you register the users, or afterward. You may want to vary the available roles for annotators because, e.g., you may want only some of them to participate in particular reconciliation phases; say, you might want only some annotators to be able to perform the decisive human_decision reconciliation step.

register_users

This operation allows you to add registered users to your workspace. Perhaps you want to be able to track the contributions of multiple annotators, or you might want to actually assign documents to multiple annotators and do multiple annotation. You may also want to assign roles to your users. You cannot unregister users once they're registered, although you can remove all their roles.

This operation is only available on the command line.

list_users

This operation lists the users in a workspace. It is also available as part of the workspace_configuration operation.

This operation is only available on the command line.

add_roles

The add_roles operation adds roles to existing users.

This operation is only available on the command line.

remove_roles

The remove_roles operation removes roles from existing users.

This operation is only available on the command line.

Automated tagging

By default, the workspace will attempt to ensure that each file is positioned at an opportunity for user interaction. When a file is imported, the workspace advances the file to the first hand-annotatable step; when the user marks a document gold in a given step, the workspace attempts to advance to the next hand-annotatable step (assuming no reviews are scheduled). If a model exists for a given step, it will be applied to documents in the appropriate circumstances.

modelbuild

This operation builds a model which can be used to automatically tag other documents. Every document in the workspace which is gold or reconciled for the relevant annotation set is used to build this model. If there are multiple copies of a document because the document is multiply assigned, all copies will be used (so that document will be overrepresented in the model, and all conflicting annotations will be used as well). You can optionally ask the workspace to autotag documents after the model is built.

Note: the workspace model is completely distinct from the default task model.

This operation is only available on the command line.

Configuring the modelbuild operation in task.xml

In task.xml, by creating a workspace configuration, you can customize your modelbuild operation, e.g., restrict it to just the gold segments. You can use any setting that's available to the training engine.

  <workspaces default_config="Demo">
    ...
    <workspace workflow="Demo" config_name="Custom Demo">
      ...
      <operation name="modelbuild">
        <settings partial_training_on_gold_only="yes"/>
      </operation>
      ...
    </workspace>
  </workspaces>

advance

By default, documents advance automatically to the next hand-annotatable step. Several operations permit you to suppress advancement. If you do, you can complete the advancement later using this operation. This operation automatically advances the document to the next hand-annotatable point, or to the end of the workflow if there are no more hand-annotatable points. You can specify individual basenames to process, or process all documents.

Note: this operation does not use the jCarafe tagging server, even in the UI. So the startup cost of the tagging engine is incurred each time the autotag operation is executed. This operation also does not use the default task model, ever; it only uses models constructed using the modelbuild operation.

This operation is available in the MAT UI (for individual documents) and on the command line. When used in the UI, it will trigger a save operation first if the document has unsaved changes.

Configuring the advance operation in task.xml

In task.xml, by creating a workspace configuration, you can customize how automated advancement happens. This customization will apply not just to explicit invocations of the "advance" operation, but also to every operation which automatically advances (e.g., markgold, complete_human_review). The one exception is import, which whose initial processing is governed by customizations to the "import" operation; however, any advancement after marking gold or reconciled on import is covered by the customizations here.

The key/value pairs here are the same as the ones available to MATEngine and the steps it executes, with the caveat that workflow, steps, documents and file types are computed for you. Because these customizations apply to all advancements within the workspace's workflow, you should provide all the options you'd want (beyond the initial import processing). If, for instance, your workflow contains a step which accepts the "allow_foo" option, you can specify it here and it will passed to that step when it's applied, and otherwise ignored:

  <workspaces default_config="Demo">
    ...
    <workspace workflow="Demo" config_name="Custom Demo">
      ...
      <operation name="advance">
        <settings allow_foo="yes"/>
      </operation>
      ...
    </workspace>
  </workspaces>

Experimentation

You can use your workspace as a corpus for experiments. You can access this capability via the <workspace_corpora> element for MATExperimentEngine, or you can access it via the workspace engine. You can further subdivide your workspace into basename sets which can be referred to in your experiment.

list_basename_sets

This operation lists the basename sets and their contents. This operation is only available on the command line.

add_to_basename_set

This operation adds basenames to a given basename set (and implicitly creates the set if necessary). This operation is only available on the command line.

remove_from_basename_set

This operation removes basenames from a given basename set (and implicitly removes the set if necessary). This operation is only available on the command line.

run_experiment

This operation allows you to run an experiment based on this workspace, either using an experiment file or by specifying the properties of the test set in terms of properties of the workspace basenames. This operation is only available on the command line.

Administration

force_unlock

This operation forces a basename in the named folder to be unlocked. In the reconciliation and review folders, it will advance to the next hand-annotatable step by default.

Warning: be very certain that you apply the force_unlock operation only to basenames whose locks have been stranded. If you unlock a basename which is being annotated, the annotator will not be able to save her changes.

This operation is only available on the command line.

Workspace security

Unlike file mode, workspace mode is stateful from the point of view of the UI. It is the server, rather than the client, which loads and saves the files. However, we don't want just anybody to be able to cause the server to perform these stateful operations, so the MAT web server implements some security mechanisms.

Note, however, that the MAT workspace functionality is not an enterprise-secure implementation, and will never be one. It does not use SSL; it does not perform any sort of user authentication beyond the workspace key; it does not provide any security logging or traceability; and it does not currently implement transactions. You should assume that anyone who has access to your network can see your workspace traffic, and overwrite your data.

Note that workspace users play no role in workspace security.

Workspace locking

Workspaces maintain an internal lock to ensure that any operations which change the state of the workspace are exclusive. This locking mechanism is quite simple - it relies on the presence or absence of the "opLockfile" file. If something goes horribly wrong, it's possible that the workspace may get in a stranded state, where it fails to remove "opLockfile" at the end of the operation. If you're getting a notification that the workspace is in use, and you're sure it's not, you can remove the file by hand. As an added bonus, the file contents will tell you what operation was being performed by which user, and what time the lock was established.

Advanced topic: workspace review and reconciliation

Workspaces support the option of reviewing documents after they're annotated. You can schedule a review in advance, for any document that completes a particular step, or, if there's no existing schedule, you can request an ad-hoc review after you complete a step. Finally, you can use a requested review to repair errors in previous steps.

There are four types of document review:

review type	target folder	availability	relevant operations	how does it work?
human	review	schedule, ad-hoc	schedule_review, unschedule_review, list_review_schedule, request_review, complete_human_review	The document is copied from the core folder to the review folder. An annotator with review privileges (other than the one who last annotated the document in the core folder) reviews the document, and applies the "Save & Done" operation in the UI when satisfied. Once the review document is complete, it is copied back to the core folder and marked reconciled for the step just completed.
reconciliation	reconciliation	schedule	schedule_review, unschedule_review, list_review_schedule, remove_from_reconciliation	This review is intended to be used when documents are multiply assigned. When this type of review is scheduled for a step and an annotator completes that step, the document is placed in a "suspended" state until all the versions of this document have completed the relevant step. At that point, a reconciliation document is created and inserted into the reconciliation folder, and an annotator with review privileges reconciles the conflicts in the reconciliation document, and applies the "Save & Done" operation, which closes the completely reconciled document. Once the reconciliation document is closed, it is converted back into a normal document and copied back into the core folder, replacing the documents which were submitted for the review. These now-reviewed documents are marked reconciled for the step just completed. At this point, all the copies of the document will be identical.
reconciliation_with_crossvalidation	reconciliation	schedule, ad-hoc	schedule_review, unschedule_review, list_review_schedule, apply_crossvalidation, remove_from_reconciliation, request_review	This review is like reconciliation review, except that it can be used with single assignment, or no assignment at all; in fact, when it's an ad-hoc request, there must only be one copy of the document in the workspace. When this type of review is triggered by a schedule, or requested ad-hoc, its "suspended" state also involves awaiting cross-validation. Once the user is satisfied that enough documents have accumulated to do cross-validation, she calls the apply_crossvalidation operation, which creates another copy of the document, based on cross-validation-trained models. This additional document copy is added to the reconciliation document, and the review proceeds as above.
repair	review	ad-hoc	request_review, complete_human_review	This review is a special kind of human review, in which the reviewing user does not require the reviewer role; is the same as the last person who touched the document; and does not mark the document reconciled when it's completed. It's intended for special situations where you've made a mistake in a previous workspace step (which you can't return to).

schedule_review

This operation allows you to schedule a review. This operation is only available on the command line.

unschedule_review

This operation allows you to remove a scheduled review. This operation is only available on the command line.

list_review_schedule

This operation will list the scheduled reviews, by step. This operation is only available on the command line.

apply_crossvalidation

Use this operation to apply crossvalidation to accumulated documents which are waiting for it. In general, you should allow a reasonable number of documents to accumulate awaiting crossvalidation before you trigger it, since otherwise, it'll essentially do the same thing that autotagging does.

This operation is only available on the command line.

Configuring the apply_crossvalidation operation in task.xml

In task.xml, by creating a workspace configuration, you can customize the crossvalidation defaults. Here's how you'd do it:

  <workspaces default_config="Demo">
    ...
    <workspace workflow="Demo" config_name="Custom Demo">
      ...
      <operation name="apply_crossvalidation">
        <settings folds="..."/>
      </operation>
      ...
    </workspace>
  </workspaces>

remove_from_reconciliation

If, for some reason, a document fails to exit reconciliation naturally (if some of the users fail to complete their reconciliation steps, for example), you can use this operation to remove the document forcibly from reconciliation. You have the option of discarding the reconciliation decisions that were made. By default, this operation will advance the document to the next hand-annotatable step. This operation is only available on the command line.

request_review

If the current step isn't scheduled for review or reconciliation, you can request a review yourself, if you want one. Only human review and reconciliation with crossvalidation are available; you can't request a review for a document assigned to someone else.

The 'repair' review type is special; it's equivalent to requesting a human review which you'll conduct yourself, on a document which isn't complete in its current step.

This operation is only available on the command line.

complete_human_review

If a document is in the human review folder, you can indicate that you're satisfied with the document with this operation. If the document isn't being reviewed for repair, this operation will mark the document reconciled for the current step, and then advance the document to the next hand-annotatable step. This operation is only available on the command line, or in the UI via the "Save & Done" operation in the review folder.

Advanced topic: the workspace database

The workspace database is an SQLite database which tracks the status of documents, users, and the workspace itself. The schema can be found in MAT_PKG_HOME/lib/mat/python/MAT/ws_db.sql. The tables are:

document_info: contains the basenames and document names in the core folder, the user they're assigned to, the transaction ID, the current step, and the document and workflow status
review_schedules: contains the review types and the steps they're scheduled for
review_info: contains the documents currently in the review folder, their transaction ID, and whether they're a repair review
reconciliation_info: contains the documents currently in the reconciliation folder, and their transaction ID
reconciliation_to_core: contains the correspondence between the documents in the reconciliation folder and the documents in the core folder they were built out of
users: lists the users in the workspace
user_roles: the role assignments for the users
workspace_state: specifies the workspace-level metadata, including the task, language, workspace configuration, and the number of retained models
basename_sets: specifies the basename sets and basenames in them

Troubleshooting

Failed import

You may realize, once you've completed an import operation, that you didn't import the basenames the way you'd wanted; perhaps you'd intended to strip a suffix, or you assigned them to the wrong workspace user. You can use the remove operation to remove the basenames from the workspace in preparation for re-importing. Warning: this operation will remove all traces of the basenames from the workspace folders and the database. Do not use it unless you really want them removed.

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove basename1...

If you're not sure what basenames are available, the --help option will list them:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> remove --help

Locked files

The workspaces do not permit documents to be edited by more than one annotator at a time. The workspaces achieve this exclusivity through the use of file locks, which are recorded in the workspace database. When an annotator opens a document for annotation, the annotation UI is given a lock ID which it can use to release the document when the editing session is over. In some circumstances, unfortunately, the document is not unlocked; for instance, if the UI encounters an unexpected error and crashes before unlocking the document. You can use the force_unlock operation to clear this lock from the database.

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> force_unlock --user user1 core basename1

If you just want to unlock everything, don't specify any basenames. If you want to know what's locked, use the dump_database operation:

% $MAT_PKG_HOME/bin/MATWorkspaceEngine <dir> dump_database

This will show you the content of the workspace database tables.

Error "workspace is currently unavailable (processing another request)"

If you get this error message, and you're absolutely certain that no one else is working on the workspace, something horrible has happened, and a previous operation has failed in such a way to fail to remove the "opLockfile" file. More on how to deal with this here.