While MAT is Unicode-aware, not all Unicode characters are
created equal. Those which aren't a character in a modern written
language are handled inconsistently by language processing tools
(including, possibly, MAT). By default, MAT will warn you about
the presence of these characters. But you can change how MAT treats them.
Migrating MAT to Python 3 will exacerbate a problem with Unicode
that has lurked for a very long time, but has seldom been
encountered because of a set of coincidences which will no longer
arise.
In general, MAT is Unicode-aware. It allows you to specify
language script direction when you list the languages associated
with your task (it's not smart enough to know without being told,
but you can control it), and its native
document format is encoded in UTF-8, which can represent the
full range of Unicode characters. MAT has been successfully used
to annotate documents in a wide range of languages. However,
there's a very subtle problem with Unicode in MAT (and many, many
other language processing tools) which can lead to mysterious
misalignment of annotations. In order to understand this, you need
to know something about how MAT thinks about annotations, and
about how programming languages implement Unicode.
MAT uses character offsets to represent the location of
annotations. Unlike in-line XML, which marks the location of an
element by inserting start and end brackets into the character
sequence (e.g., "<PERSON>George Washington</PERSON>
died"), MAT notes, outside the document signal, the presence of a
PERSON annotation from character 0 to character 17. In order to do
this, the length of strings (and substrings) has to be
consistently computed. For all of modern human language, this
computation is reliable. However, for other characters (e.g.,
musical notation, heiroglyphics), it is not.
The reason for this has to do with Unicode. (If you don't know
anything about Unicode, stop immediately and read Joel
Spolsky's Unicode primer.) There are 2^20 possible Unicode
characters (that's not exactly the right number or terminology,
but it will serve for the current discussion). Of those, the first
2^16 (approximately) are in Unicode plane 0, otherwise known as
the Basic Multilingual Plane (BMP). All of the
characters in any living written language are in the BMP, and
these characters aren't a problem for us; it's all the other
characters that can cause a problem. This is because of how
programming languages implement Unicode.
The mapping between Unicode characters and byte sequences is
handled by character encodings. UTF-8 and ASCII, which you're
likely familiar with, are character encodings. Some of these
encodings (like ASCII) encode only a subset of Unicode; others
(like UTF-8) encode the whole thing. Because Unicode has 2^20
possible characters, in order to encode the whole thing, you
either need an encoding (e.g., UCS-4) which has enough bytes per
character (4 bytes in the case of UCS-4), or you need a variable
length encoding, which handles the larger character numbers using
more bytes. UTF-8 is one of these variable-length encodings, but
the one we care about right now is UTF-16.
UTF-16 uses, by default, 2 bytes for each character. This allows
it to handle all of the BMP. The way it handles the rest of the
characters is that it has a special range of 2-byte sequences
(whose Unicode character numbers are specifically reserved for
this purpose). These 2-byte sequences can only be interpreted as
characters by including the following 2-byte sequence; it's called
a "surrogate pair", and it allows UTF-16 to support all the
characters outside the BMP, and the reason we care about this is
that virtually all programming languages, by default, implement
their strings as UTF-16 sequences. What this means is that, in
these programming languages, the length of the string "I ❤
Unicode" (where the heart is Unicode code point U+2764, in the
BMP) is 11, but the length of "I 😊 Unicode" (where the smiley
face is Unicode code point U+1F60A, outside the BMP) is 12.
This becomes a potential problem when multiple programming
languages need to extract substrings from the document. If all the
programming languages involved treat strings consistently with
each other, there's no problem, even if from the point of Unicode
characters, the length of characters outside the BMP is "wrong".
If the programming languages don't treat the strings in
the same way, that's where the problems begin. Here's a sample of
how programming languages handle strings:
So in Python 2, MAT's three core components (jCarafe, the core
MAT library, the Web UI) all make the same Unicode "mistake", and
although the length of the Unicode smiley face is 2, it's 2
consistently, and if MAT were to control the document from the
beginning of its processing to the end, the strings would be
consistently handled. In Python 3, on the other hand, this is not
the case, because the core MAT library handles offsets differently
than the jCarafe and Web UI implementations.
This problem has existed within MAT for a long time, because, in
its role as a "Swiss army kniife" for annotated documents, MAT is
intended to be able to interpret annotated documents produced
elsewhere. And if the representation of these external annotations
involves offsets, MAT has no way, at all, of knowing whether the
engine or tool which produced the annotations accounted for offset
discrepancies outside the BMP. So, for instance, consider a
document annotated using the spaCy tool, which is implemented in Python.
If the document representation that spaCy produces uses offsets,
the nature of the offsets will likely differ depending on whether
spaCy was run in Python 2 or in Python 3, and there's no way to
know from examining the document itself. One can guess; if, e.g.,
the annotation boundaries largely fall on whitespace boundaries
before a non-BMP character is encountered, and don't afterward,
and adjusting the indices to account for this error increases the
proportion of annotation boundaries which fall on whitespace
boundaries, it's likely that the tool in question doesn't account
for the BMP issue. But it's not possible to know for sure, and
there's no way to know whether multiple tools have already applied
inconsistent offset annotations.
MAT provides a way of partially mitigating the offset issue.
The HANDLE_NON_BMP configuration
variable, introduced in MAT 3.3, has four possible values: