Friday, October 18, 2013

Termcat: Details On the Current Implementation and the Potential for Optimization

In my previous blog post I gave a general overview of the rewriting system behind Termcat. Below I explain how this system is implemented in Clojure. I also mention some potential optimizations.

This blog post assumes basic familiarity with Clojure and Termcat.

The General Case

Recall that rewriting rules transform trees of tokens into trees of tokens. Equivalently, with term a token or a sequence of tokens, they transform sequences of terms into sequences of terms.

In Clojure we represent rewriting rules by functions that take the following arguments:
  • An object state
  • A sequence result
  • A term v
Rules return pairs of the following elements:
  • An object new-state
  • An update sequence new-result
Rules have the following meta-data (meaning, things you need to know to apply the rule):
  • An integer padding-count
  • An object init-state
When we run (rewrite terms rule) this happens:
  • Every term t in terms that is a sequence is replaced by the result of (rewrite t rule). (Recursion ends when terms is a sequence of tokens.)
  • Next, we call (rule init-state (repeat padding-count nil) (first terms))
  • The function rule is called once for all other terms in terms (both tokens and sequences), where the state and result passed to each call is what the previous call returned
  • Finally, rule is called padding-count more times with nil passed for the last element
Interestingly, all but two rules fit the following simple pattern:
  • They do pattern matching on the term or token type (see my previous post) of v and of the last padding-count elements of result
  • The new result that they return is the old result, minus possibly its padding-count last elements, plus a number of new terms
The rule that converts the token sequence into an abstract syntax tree is the most notable rule that does not adhere to this pattern. It needs to be able to pop any number of elements from the accumulated result.

The other rule that violates the above pattern is the rule for evaluating functions. This is because this rule may need to pop any number of function arguments from the accumulated result. However, this is merely a matter of convenience. By using currying we could make this case conform to the aforementioned pattern.

A Macro For the Simple Pattern

Termcat uses a macro defrule to deal with the simple pattern. Using this macro a rule might look like this:

(defrule remove-superfluous-whitespace
  [state t1 t2]
  tt
  [_ (:or nil
          :emptyline
          [:block :indent]
          [:block :bullet]) :whitespace] [nil t1]
  [_ :whitespace (:or nil
                      :emptyline
                      [:block :indent]
                      [:block :bullet])] [nil t2])

The first argument assigns names to the states and the tokens that the rule wants to read. Implicitly, thereby, it also specifies the number of tokens it wants to read and the desired padding-count (number of tokens, minus 1). Here, it reads one element t1 from result. The second element t2 corresponds to v.

The second argument, tt, specifies that we want to match on the term-type of t1 and t2. The remaining arguments do pattern matching (using core.match) on state, (tt t1), and (tt t2).

A rule defined with defrule is expected to return a vector of the new state and the tokens that should replace the tokens it read. Alternatively, nil can be returned to indicate that nothing should change. This is also what happens if none of the patterns match the input.

Finite State Machines

The rules we discussed are simple enough that the matching part of the rules can be transformed into regular expressions or finite state machines. The difficulty is handling the case where the rules make substitutions efficiently.

In essence, transforming a Termcat document into an HTML document is a matter of applying a bunch of rules sequentially. That is, we're faced with something like this:

(-> (read-termcat-document "hello.tc")
    (rewrite tokenizer-rule1)
    (rewrite tokenizer-rule2)
    ...
    (rewrite generate-abstract-syntax-tree)
    (rewrite post-ast-rule1)
    (rewrite post-ast-rule2)
    ...
    (write-html-document))

It would be interesting to investigate whether or not the different rewriting rules could be automatically—using macros—merged into a single finite state machine. Ideally we would end up with something like this:

(-> (read-termcat-document "hello.tc")
    (rewrite merged-tokenizer-rules)
    (rewrite generate-abstract-syntax-tree)
    (rewrite merged-post-ast-rules)
    (write-html-document))

Parallelism and Fast Recompilation

All of the rewriting rules (including the non-simple ones) are deterministic. Their output depends only on the following factors: (i) the state, (ii) the tokens that they read from result, and (iii) v.

This is incredibly exciting! Here's why: Parallelism and Fast Recompilation.

Suppose we want to rewrite a dozen tokens using a rule rule1. The first token has the payload (or content) '1', the second token has the payload '2', and so on. What we could do is apply rule 1 in parallel to (i) the complete sequence of 12 tokens and (ii) to the sequence of token 7-12. These two processes are depicted below.
The coloring indicates which calls to rule1 have identical input (meaning the tokens they read from result, their state, and v are identical).

We can see that for token 7 the input to rule1 is different for both processes. This could be because in the first process rule1 rewrote token 5 or 6. It could also be because the state is different for both function calls. Here we should discard the output of the second process. When the first process arrives at the eight token, however, we see that its arguments for rule1 are identical to the arguments that the second process used. This is great. This means we can terminate the first process and append the result of the first process to the result of the second process (discarding those tokens that were emitted after the second process rewrote token 7 and that were not read in by rule1 upon rewriting token 8).

Similar optimization should be possible when the source code of a Termcat document is modified. Think fast (partial) recompilation.

Three caveats:

  1. Coordinating this sort of parallelism is inherently tricky. For small inputs it would definitely not be worth it. And what's 'small' in this context?
  2. Moreover, it's not clear if the overhead of comparing the arguments for rule1 would, in practice, ever be worth the time saved by not having to do redundant rewriting.
  3. This optimization is not applicable to the rule for generating abstract syntax trees since this rule (i) tends to keep a lot of state around and (ii) may need to read an arbitrary number of elements from result.
Note that much of the last two sections is speculation. I might very well dig deeper into these things at some point because I think that would be a lot of fun. Nevertheless, this will probably have to wait until after Hacker School since I want to focus on more basic features first.

Meanwhile, I'm approaching a point where I want to start demoing Termcat to people and get feedback. In my next blog post I will give an overview of the features currently implemented.