Unicode issues

The short version

While MAT is Unicode-aware, not all Unicode characters are created equal. Those which aren't a character in a modern written language are handled inconsistently by language processing tools (including, possibly, MAT). By default, MAT will warn you about the presence of these characters. But you can change how MAT treats them.

The long version

Migrating MAT to Python 3 will exacerbate a problem with Unicode that has lurked for a very long time, but has seldom been encountered because of a set of coincidences which will no longer arise.

In general, MAT is Unicode-aware. It allows you to specify language script direction when you list the languages associated with your task (it's not smart enough to know without being told, but you can control it), and its native document format is encoded in UTF-8, which can represent the full range of Unicode characters. MAT has been successfully used to annotate documents in a wide range of languages. However, there's a very subtle problem with Unicode in MAT (and many, many other language processing tools) which can lead to mysterious misalignment of annotations. In order to understand this, you need to know something about how MAT thinks about annotations, and about how programming languages implement Unicode.

MAT uses character offsets to represent the location of annotations. Unlike in-line XML, which marks the location of an element by inserting start and end brackets into the character sequence (e.g., "<PERSON>George Washington</PERSON> died"), MAT notes, outside the document signal, the presence of a PERSON annotation from character 0 to character 17. In order to do this, the length of strings (and substrings) has to be consistently computed. For all of modern human language, this computation is reliable. However, for other characters (e.g., musical notation, heiroglyphics), it is not.

The reason for this has to do with Unicode. (If you don't know anything about Unicode, stop immediately and read Joel Spolsky's Unicode primer.) There are 2^20 possible Unicode characters (that's not exactly the right number or terminology, but it will serve for the current discussion). Of those, the first 2^16 (approximately) are in Unicode plane 0, otherwise known as the Basic Multilingual Plane (BMP). All of the characters in any living written language are in the BMP, and these characters aren't a problem for us; it's all the other characters that can cause a problem. This is because of how programming languages implement Unicode.

The mapping between Unicode characters and byte sequences is handled by character encodings. UTF-8 and ASCII, which you're likely familiar with, are character encodings. Some of these encodings (like ASCII) encode only a subset of Unicode; others (like UTF-8) encode the whole thing. Because Unicode has 2^20 possible characters, in order to encode the whole thing, you either need an encoding (e.g., UCS-4) which has enough bytes per character (4 bytes in the case of UCS-4), or you need a variable length encoding, which handles the larger character numbers using more bytes. UTF-8 is one of these variable-length encodings, but the one we care about right now is UTF-16.

UTF-16 uses, by default, 2 bytes for each character. This allows it to handle all of the BMP. The way it handles the rest of the characters is that it has a special range of 2-byte sequences (whose Unicode character numbers are specifically reserved for this purpose). These 2-byte sequences can only be interpreted as characters by including the following 2-byte sequence; it's called a "surrogate pair", and it allows UTF-16 to support all the characters outside the BMP, and the reason we care about this is that virtually all programming languages, by default, implement their strings as UTF-16 sequences. What this means is that, in these programming languages, the length of the string "I ❤ Unicode" (where the heart is Unicode code point U+2764, in the BMP) is 11, but the length of "I 😊 Unicode" (where the smiley face is Unicode code point U+1F60A, outside the BMP) is 12.

This becomes a potential problem when multiple programming languages need to extract substrings from the document. If all the programming languages involved treat strings consistently with each other, there's no problem, even if from the point of Unicode characters, the length of characters outside the BMP is "wrong". If the programming languages don't treat the strings in the same way, that's where the problems begin. Here's a sample of how programming languages handle strings:

Java implements strings as UTF-16 sequences. It has an alternate API which counts actual Unicode characters, but it's quite hard to use in comparison to the default API; for instance, extracting a substring using Unicode character offsets is quite clumsy. It's likely that most NLP tools implemented in Java don't use this alternate API, and among those tools is MAT's default tagging engine, jCarafe.
JavaScript implements strings as UTF-16 sequences. It also has an alternate API, but its introduction is relatively recent, and MAT's JavaScript-based annotation and display UI does not use it.
Python (the language in which the core of MAT is implemented) has a much more complicated history. Python 2, by default, implements strings as UTF-16 sequences; however, it could be compiled in a way that allots 4 bytes per character instead of 2. In this non-default implementation, the length of strings is the number of Unicode characters in the string. In Python 3, on the other hand, strings are implemented in a variable-length manner, in which the space required for the characters in a string is the maximum of the space required for any of the individual characters; so any string which includes a character outside the BMP will allocate 4 bytes per character, and the length of strings is the number of Unicode characters in the string.

So in Python 2, MAT's three core components (jCarafe, the core MAT library, the Web UI) all make the same Unicode "mistake", and although the length of the Unicode smiley face is 2, it's 2 consistently, and if MAT were to control the document from the beginning of its processing to the end, the strings would be consistently handled. In Python 3, on the other hand, this is not the case, because the core MAT library handles offsets differently than the jCarafe and Web UI implementations.

This problem has existed within MAT for a long time, because, in its role as a "Swiss army kniife" for annotated documents, MAT is intended to be able to interpret annotated documents produced elsewhere. And if the representation of these external annotations involves offsets, MAT has no way, at all, of knowing whether the engine or tool which produced the annotations accounted for offset discrepancies outside the BMP. So, for instance, consider a document annotated using the spaCy tool, which is implemented in Python. If the document representation that spaCy produces uses offsets, the nature of the offsets will likely differ depending on whether spaCy was run in Python 2 or in Python 3, and there's no way to know from examining the document itself. One can guess; if, e.g., the annotation boundaries largely fall on whitespace boundaries before a non-BMP character is encountered, and don't afterward, and adjusting the indices to account for this error increases the proportion of annotation boundaries which fall on whitespace boundaries, it's likely that the tool in question doesn't account for the BMP issue. But it's not possible to know for sure, and there's no way to know whether multiple tools have already applied inconsistent offset annotations.

Managing the offset issue

MAT provides a way of partially mitigating the offset issue.

The HANDLE_NON_BMP configuration variable, introduced in MAT 3.3, has four possible values:

"warn": when MAT reads a document, and it finds characters outside the BMP, it will warn the user. This is the default behavior.
"fail": when MAT reads a document, and it finds characters outside the BMP, it will raise an error.
"ignore": MAT ignores any issues with characters outside the BMP. This is the pre-3.3 behavior.
"scrub_or_warn": when MAT reads a raw document, and it finds characters outside the BMP, it will replace them in the document signal with Unicode code point U+25A1 (WHITE SQUARE), which is a standard placeholder character within the BMP. Otherwise equivalent to "warn".

The value of this configuration variable can be overridden in the usual ways, or through the use of the --handle_non_bmp command-line option of MATEngine, MATTransducer, MATModelBuilder, MATReport, and MATScore, or via the handleNonBMP UI setting.

The scrubbing behavior here is carefully limited. Because it's impossible to know the status of existing annotation offsets, it's not possible to scrub documents that have already been annotated. So MAT only offers the scrubbing behavior on raw documents, where the document signal is first ingested. In the future, if MAT better addresses this issue, there may be more (and better) conversion options.