============== Implementation ============== The Tokeniser ============= Tokenization is performed by the ``Tokeniser`` class. The most important method in this class is by far the ``next`` method. This method consumes characters from the feeder and returns a token (if the tokenization succeeds). Tokenization Rules ------------------ Tokenization rules can are defined by declaring methods (in the ``Tokeniser`` class) whose names are preceded by ``t_``, such as in the following example: :: def t_SomeRule(self, match): # Some logic goes here... pass A tokenization rule is supposed to take a regular expression match (the ``match`` parameter of type ``re.Match``) and convert it to an appropriate token, which is then returned by the method. The rule is also responsible for updating the internal state of the tokeniser, such as incrementing the ``pos`` counter. A rule is always expected to receive sane input. In other words, deciding which rule to call is a responsibility of the caller. Rules are are also automatically called from inside of ``next``. Messaging Functionality ----------------------- Warnings and errors encountered during scanning and tokenization are collected in a message queue and stored in the feeders using the ``message`` and ``syntax_message`` methods of ``LineFeeder``. The message queue is therefore a property of the feeder. The ``Tokeniser`` class also has a method to append messages to the message queue of it's feeder, the ``syntax_message`` method. The messages are stored using Mathics' internal format, but this is going to be revised in the next release (in fact, we plan to replace messages by errors entirely). Character Conversions ===================== The ``mathics_scanner.characters`` module consists mostly of translation tables between Wolfram's internal representation and Unicode/ASCII. For maintainability, it was decided to store this data in a human-readable YAML table (in ``data/named-characters.yml``). The YAML table mainly contains information about how to convert a named character to Unicode and back. If a given character has a direct Unicode equivalent (a Unicode character whose description is similar as the named character's), this is specified by the ``unicode-equivalent`` field in the YAML table. Note that multiple named characters may share a common ``unicode-equivalent`` field. Also, if a named character has a Unicode equivalent, it's ``unicode-equivalent`` field need not to consist of a single Unicode code-point. For example, the Unicode equivalent of ``\[FormalAlpha]`` is ``U+03B1 U+0323`` (or ``GREEK SMALL LETTER ALPHA + COMBINING DOT BELOW``). If a named character has a ``unicode-equivalent`` field whose description fits the precise description of the character then it's ``has-unicode-inverse`` field in the YAML table is set to ``true``. The conversion routines ``replace_wl_with_plain_text`` and ``replace_unicode_with_wl`` use this information to convert between Wolfram's internal format and standard Unicode, but it should be noted that the conversion scheme is more complex than a simple lookup in the YAML table. The Conversion Scheme --------------------- The ``replace_wl_with_plain_text`` functions converts text from Wolfram's internal representation to standard Unicode *or* ASCII. If set to ``True``, the ``use_unicode`` argument indicates to ``replace_wl_with_plain_text`` that the input should be converted to standard Unicode. If set to ``False``, ``use_unicode`` indicates to ``replace_wl_with_plain_text`` that it should only output standard ASCII. The algorithm for converting from Wolfram's internal representation to standard Unicode is the following: - If a character has a direct Unicode equivalent then the character is replaced by it's Unicode equivalent. - If a character doesn't have a Unicode equivalent then the character is replaced by it's fully qualified name. For example, the ``\[AliasIndicator]`` character (or ``U+F768`` in Wolfram's internal representation) is replaced by the Python string ``"\\[AliasIndicator]"``. The algorithm for converting from Wolfram's internal representation to standard ASCII is the following: - If a character has a direct Unicode equivalent and all of the characters of it's Unicode equivalent are valid ASCII characters then the character is replaced by it's Unicode equivalent. - If a character doesn't have a Unicode equivalent or any of the characters of it's Unicode equivalent isn't a valid character then the character is replaced by it's fully qualified name. The ``replace_unicode_with_wl`` function converts text from standard Unicode to Wolfram's internal representation. The algorithm for converting from standard Unicode to Wolfram's internal representation is the following: - If a Unicode character sequence happens to match the ``unicode-equivalent`` of a Wolfram Language named character whose ``has-unicode-inverse`` field is set to ``true``, then the Unicode character is replaced by the Wolfram's internal representation of such named character. Note that the YAML table is maintained in such a way that there is always *at most* one character that fits such description. - Otherwise the character is left unchanged. Note that fully qualified names (such as the Python string ``"\\[Alpha]"`` or the Python string ``"Alpha"``) are *not* replaced at all. Optimizations ------------- Because of the large size of the YAML table and the relative complexity of the conversion scheme, it was decided to store precompiled conversion tables in a file and read them from disk at runtime (when the module is imported). Our tests showed that storing the tables as JSON and using `ujson `_ to read them is the most efficient way to access them. However, this is merely an implementation detail and consumers of this library should not rely on this assumption. The conversion tables are stored in the ``data/characters.json`` file, along side other complementary information used internally by the library. ``data/characters.json`` holds three conversion tables: - The ``wl-to-unicode`` table, which stores the precompiled results of the Wolfram-to-Unicode conversion algorithm. ``wl-to-unicode`` is used for lookup when ``replace_wl_with_plain_text`` is called with the ``use_unicode`` argument set to ``True``. - The ``wl-to-ascii`` table, which stores the precompiled results of the Wolfram-to-ASCII conversion algorithm. ``wl-to-ascii`` is used for lookup when ``replace_wl_with_plain_text`` is called with the ``use_unicode`` argument set to ``False``. - The ``unicode-to-wl`` table, which stores the precompiled results of the Unicode-to-Wolfram conversion algorithm. ``unicode-to-wl`` is used for lookup when ``replace_unicode_with_wl`` is called. The precompiled translation tables, as well as the rest of data stored in ``data/characters.json``, is generated from the YAML table with the ``mathics_scanner.generate.build_tables.compile_tables`` function. Note that multiple entries in the YAML table are redundant in the following sense: when a character has a Unicode equivalent equivalent but the Unicode equivalent is the same as it's Wolfram's internal representation (i.e. the ``wl-unicode`` field is the same as the ``unicode-equivalent`` field in the YAML table) then it is considered redundant for us, since no conversion is needed. As an optimization, we explicitly remove any redundant characters from *all* precompiled conversion tables. Such optimization makes the tables smaller and easier to load. This implies that not all named characters that have a Unicode equivalent are included in the precompiled translation tables (the ones that are not included are the ones where no conversion is needed).