Implementation¶
The Tokeniser¶
Tokenization is performed by the Tokeniser class. The most important
method in this class is by far the next method. This method consumes
characters from the feeder and returns a token (if the tokenization succeeds).
Tokenization Rules¶
Tokenization rules can are defined by declaring methods (in the Tokeniser
class) whose names are preceded by t_, such as in the following example:
def t_SomeRule(self, match):
# Some logic goes here...
pass
A tokenization rule is supposed to take a regular expression match (the
match parameter of type re.Match) and convert it to an appropriate
token, which is then returned by the method. The rule is also responsible for
updating the internal state of the tokeniser, such as incrementing the pos
counter.
A rule is always expected to receive sane input. In other words, deciding which
rule to call is a responsibility of the caller. Rules are are also
automatically called from inside of next.
Messaging Functionality¶
Warnings and errors encountered during scanning and tokenization are collected
in a message queue and stored in the feeders using the message and
syntax_message methods of LineFeeder. The message queue is therefore a
property of the feeder. The Tokeniser class also has a method to append
messages to the message queue of it’s feeder, the syntax_message method.
The messages are stored using Mathics’ internal format, but this is going to be revised in the next release (in fact, we plan to replace messages by errors entirely).
Character Conversions¶
The mathics_scanner.characters module consists mostly of translation tables
between Wolfram’s internal representation and Unicode/ASCII. For
maintainability, it was decided to store this data in a human-readable YAML
table (in data/named-characters.yml).
The YAML table mainly contains information about how to convert a
named character to Unicode and back. If a given character has a direct Unicode
equivalent (a Unicode character whose description is similar as the named
character’s), this is specified by the unicode-equivalent field in the YAML
table. Note that multiple named characters may share a common
unicode-equivalent field. Also, if a named character has a Unicode
equivalent, it’s unicode-equivalent field need not to consist of a single
Unicode code-point. For example, the Unicode equivalent of \[FormalAlpha]
is U+03B1 U+0323 (or GREEK SMALL LETTER ALPHA + COMBINING DOT BELOW).
If a named character has a unicode-equivalent field whose description fits
the precise description of the character then it’s has-unicode-inverse
field in the YAML table is set to true.
The conversion routines replace_wl_with_plain_text and
replace_unicode_with_wl use this information to convert between Wolfram’s
internal format and standard Unicode, but it should be noted that the
conversion scheme is more complex than a simple lookup in the YAML table.
The Conversion Scheme¶
The replace_wl_with_plain_text functions converts text from Wolfram’s
internal representation to standard Unicode or ASCII. If set to True, the
use_unicode argument indicates to replace_wl_with_plain_text that the
input should be converted to standard Unicode. If set to False,
use_unicode indicates to replace_wl_with_plain_text that it should only
output standard ASCII.
The algorithm for converting from Wolfram’s internal representation to standard Unicode is the following:
If a character has a direct Unicode equivalent then the character is replaced by it’s Unicode equivalent.
If a character doesn’t have a Unicode equivalent then the character is replaced by it’s fully qualified name. For example, the
\[AliasIndicator]character (orU+F768in Wolfram’s internal representation) is replaced by the Python string"\\[AliasIndicator]".
The algorithm for converting from Wolfram’s internal representation to standard ASCII is the following:
If a character has a direct Unicode equivalent and all of the characters of it’s Unicode equivalent are valid ASCII characters then the character is replaced by it’s Unicode equivalent.
If a character doesn’t have a Unicode equivalent or any of the characters of it’s Unicode equivalent isn’t a valid character then the character is replaced by it’s fully qualified name.
The replace_unicode_with_wl function converts text from standard Unicode to
Wolfram’s internal representation. The algorithm for converting from standard
Unicode to Wolfram’s internal representation is the following:
If a Unicode character sequence happens to match the
unicode-equivalentof a Wolfram Language named character whosehas-unicode-inversefield is set totrue, then the Unicode character is replaced by the Wolfram’s internal representation of such named character. Note that the YAML table is maintained in such a way that there is always at most one character that fits such description.Otherwise the character is left unchanged. Note that fully qualified names (such as the Python string
"\\[Alpha]"or the Python string"Alpha") are not replaced at all.
Optimizations¶
Because of the large size of the YAML table and the relative complexity of the conversion scheme, it was decided to store precompiled conversion tables in a file and read them from disk at runtime (when the module is imported). Our tests showed that storing the tables as JSON and using ujson to read them is the most efficient way to access them. However, this is merely an implementation detail and consumers of this library should not rely on this assumption.
The conversion tables are stored in the data/characters.json file, along
side other complementary information used internally by the library.
data/characters.json holds three conversion tables:
The
wl-to-unicodetable, which stores the precompiled results of the Wolfram-to-Unicode conversion algorithm.wl-to-unicodeis used for lookup whenreplace_wl_with_plain_textis called with theuse_unicodeargument set toTrue.The
wl-to-asciitable, which stores the precompiled results of the Wolfram-to-ASCII conversion algorithm.wl-to-asciiis used for lookup whenreplace_wl_with_plain_textis called with theuse_unicodeargument set toFalse.The
unicode-to-wltable, which stores the precompiled results of the Unicode-to-Wolfram conversion algorithm.unicode-to-wlis used for lookup whenreplace_unicode_with_wlis called.
The precompiled translation tables, as well as the rest of data stored in
data/characters.json, is generated from the YAML table with the
mathics_scanner.generate.build_tables.compile_tables function.
Note that multiple entries in the YAML table are redundant in the following
sense: when a character has a Unicode equivalent equivalent but the Unicode
equivalent is the same as it’s Wolfram’s internal representation (i.e. the
wl-unicode field is the same as the unicode-equivalent field in the
YAML table) then it is considered redundant for us, since no conversion is
needed.
As an optimization, we explicitly remove any redundant characters from all precompiled conversion tables. Such optimization makes the tables smaller and easier to load. This implies that not all named characters that have a Unicode equivalent are included in the precompiled translation tables (the ones that are not included are the ones where no conversion is needed).