Jonas De Vuyst’s Blog: October 2013

Monday, October 28, 2013

Termcat: Random Remarks

Current Feature-Set

Here's what Termcat can currently do.

Words can be emphasized by writing *italics* for italics. Sentences or sentence fragments can be emphasized *(like this)*. The outermost parentheses are automatically removed: like this. There's also syntax for **bold** and _underlined_ text.

Blockquotes are created by indenting text using two spaces.

Bullet lists are created like this:

- This is notation
- for bullet
list.

As demonstrated above, bullet items can be continued on a new line by indenting those subsequent lines with two spaces. Bullet lists can also be nested and bullet items may contain blockquotes or multiple paragraphs.

Numbered lists are created like this:

:numbered-list
- item 1
- item 2

Termcat actually interprets this as the (curried) function call :numbered-list(item 1)(item 2). In fact, all syntactic sugar is translated into such function calls. For example, *this* is actually rewritten as :emph(this).

Fun fact. A sentence fragment can also be italicized by writing this:

:emph
this is a function argument, not a blockquote

or

:emph
- this is not a real bullet item

The reason for this is that bullet items, indented blocks, as well as text in parentheses or other brackets are internally represented as the same thing. Hence writing :foo(bar)(baz), :foo[bar]{baz}, or

:foo
- bar
- baz

all amount to the same thing.

Links are created using the syntax [Text](http://example.com).

Images are created like this: ![Alt text](example.jpg)

Finally, there's support for creating section titles:

# This is a section title (h1)
## This is a subsection title (h2)

Planned Features

I have implemented rudimentary support for mathematical equations. There are still a lot of details to work out, though, and I hope to finish this work by the end of this week.

I have also started thinking about a Termcat-JavaScript interface. Specifically, I'd love it if existing JavaScript components could be reused in Termcat. I'm not sure what the interface for that should look like, though.

Zach, one of the Hacker School facilitators, pointed me to x-tags (which is inspired by web components) and this might be one way out. If I dictated that JavaScript code should be called via x-tags then I wouldn't have to add much more to Termcat than the ability to output arbitrary HTML tags. Nevertheless, I'm not sure x-tags is the answer. For instance, I think that x-tags / web components syntax is not well-suited for describing graphs, whereas in LaTeX graphs can be somewhat elegantly described using PGF/TikZ.

Related Work

In my previous blog post I mentioned that there seemed to be a link between my rewriting rules and finite state automata. To that I want to add that there are also clear links to concatenative programming in that the input and output of every rewriting rule solely consists of pop and push operations on a stack. (Kudos to Vlad for that link.)

David, another Hacker School facilitator, pointed me to an interesting research and pedagogic program under the name 'nanopass compiler'. The idea in this program is to construct compilers out of many passes that each perform one kind of simple transformation. Sounds like a rewriting system to me! The big difference with the Termcat rewriting system seems to be that in a nanopass compiler you construct a formal grammar for every intermediate language, whereas I don't have a formal grammar at any point.

Friday, October 18, 2013

Termcat: Details On the Current Implementation and the Potential for Optimization

In my previous blog post I gave a general overview of the rewriting system behind Termcat. Below I explain how this system is implemented in Clojure. I also mention some potential optimizations.

This blog post assumes basic familiarity with Clojure and Termcat.

The General Case

Recall that rewriting rules transform trees of tokens into trees of tokens. Equivalently, with term a token or a sequence of tokens, they transform sequences of terms into sequences of terms.

In Clojure we represent rewriting rules by functions that take the following arguments:

An object state
A sequence result
A term v

Rules return pairs of the following elements:

An object new-state
An update sequence new-result

Rules have the following meta-data (meaning, things you need to know to apply the rule):

An integer padding-count
An object init-state

When we run (rewrite terms rule) this happens:

Every term t in terms that is a sequence is replaced by the result of (rewrite t rule). (Recursion ends when terms is a sequence of tokens.)
Next, we call (rule init-state (repeat padding-count nil) (first terms))
The function rule is called once for all other terms in terms (both tokens and sequences), where the state and result passed to each call is what the previous call returned
Finally, rule is called padding-count more times with nil passed for the last element

Interestingly, all but two rules fit the following simple pattern:

They do pattern matching on the term or token type (see my previous post) of v and of the last padding-count elements of result
The new result that they return is the old result, minus possibly its padding-count last elements, plus a number of new terms

The rule that converts the token sequence into an abstract syntax tree is the most notable rule that does not adhere to this pattern. It needs to be able to pop any number of elements from the accumulated result.

The other rule that violates the above pattern is the rule for evaluating functions. This is because this rule may need to pop any number of function arguments from the accumulated result. However, this is merely a matter of convenience. By using currying we could make this case conform to the aforementioned pattern.

A Macro For the Simple Pattern

Termcat uses a macro defrule to deal with the simple pattern. Using this macro a rule might look like this:

(defrule remove-superfluous-whitespace
[state t1 t2]
tt
[_ (:or nil
:emptyline
[:block :indent]
[:block :bullet]) :whitespace] [nil t1]
[_ :whitespace (:or nil
:emptyline
[:block :indent]
[:block :bullet])] [nil t2])

The first argument assigns names to the states and the tokens that the rule wants to read. Implicitly, thereby, it also specifies the number of tokens it wants to read and the desired padding-count (number of tokens, minus 1). Here, it reads one element t1 from result. The second element t2 corresponds to v.

The second argument, tt, specifies that we want to match on the term-type of t1 and t2. The remaining arguments do pattern matching (using core.match) on state, (tt t1), and (tt t2).

A rule defined with defrule is expected to return a vector of the new state and the tokens that should replace the tokens it read. Alternatively, nil can be returned to indicate that nothing should change. This is also what happens if none of the patterns match the input.

Finite State Machines

The rules we discussed are simple enough that the matching part of the rules can be transformed into regular expressions or finite state machines. The difficulty is handling the case where the rules make substitutions efficiently.

In essence, transforming a Termcat document into an HTML document is a matter of applying a bunch of rules sequentially. That is, we're faced with something like this:

(-> (read-termcat-document "hello.tc")
(rewrite tokenizer-rule1)
(rewrite tokenizer-rule2)
...
(rewrite generate-abstract-syntax-tree)
(rewrite post-ast-rule1)
(rewrite post-ast-rule2)
...
(write-html-document))

It would be interesting to investigate whether or not the different rewriting rules could be automatically—using macros—merged into a single finite state machine. Ideally we would end up with something like this:

(-> (read-termcat-document "hello.tc")
(rewrite merged-tokenizer-rules)
(rewrite generate-abstract-syntax-tree)
(rewrite merged-post-ast-rules)
(write-html-document))

Parallelism and Fast Recompilation

All of the rewriting rules (including the non-simple ones) are deterministic. Their output depends only on the following factors: (i) the state, (ii) the tokens that they read from result, and (iii) v.

This is incredibly exciting! Here's why: Parallelism and Fast Recompilation.

Suppose we want to rewrite a dozen tokens using a rule rule1. The first token has the payload (or content) '1', the second token has the payload '2', and so on. What we could do is apply rule 1 in parallel to (i) the complete sequence of 12 tokens and (ii) to the sequence of token 7-12. These two processes are depicted below.

The coloring indicates which calls to rule1 have identical input (meaning the tokens they read from result, their state, and v are identical).

We can see that for token 7 the input to rule1 is different for both processes. This could be because in the first process rule1 rewrote token 5 or 6. It could also be because the state is different for both function calls. Here we should discard the output of the second process. When the first process arrives at the eight token, however, we see that its arguments for rule1 are identical to the arguments that the second process used. This is great. This means we can terminate the first process and append the result of the first process to the result of the second process (discarding those tokens that were emitted after the second process rewrote token 7 and that were not read in by rule1 upon rewriting token 8).

Similar optimization should be possible when the source code of a Termcat document is modified. Think fast (partial) recompilation.

Three caveats:

Coordinating this sort of parallelism is inherently tricky. For small inputs it would definitely not be worth it. And what's 'small' in this context?
Moreover, it's not clear if the overhead of comparing the arguments for rule1 would, in practice, ever be worth the time saved by not having to do redundant rewriting.
This optimization is not applicable to the rule for generating abstract syntax trees since this rule (i) tends to keep a lot of state around and (ii) may need to read an arbitrary number of elements from result.

Note that much of the last two sections is speculation. I might very well dig deeper into these things at some point because I think that would be a lot of fun. Nevertheless, this will probably have to wait until after Hacker School since I want to focus on more basic features first.

Meanwhile, I'm approaching a point where I want to start demoing Termcat to people and get feedback. In my next blog post I will give an overview of the features currently implemented.

Saturday, October 12, 2013

The Termcat Rewriting System and Its Different Stages

In my previous blog post I described some of the syntax of Termcat, a markup language that I have been working on while attending Hacker School. In this post I'd like to outline how I'm implementing it.

Termcat documents can be thought of as a sequence of Termcat expressions. Compiling Termcat documents into HTML involves applying a series of transformations to these expressions until every one of them is reduced to a piece of HTML code. The concatenation of this HTML code, wrapped in some leading and trailing code, constitutes the output document.

When I say 'applying transformations' I'm being very literal. Everything after reading the file and before mixing in the leading and trailing code is implemented as a rewriting system. There are six stages to this process. The purpose of this blog post is to explain them one by one.

Pre-tokenization (1/6)

The Termcat document is initially loaded as a sequence of Unicode characters. Next, every one of these characters is mapped onto a 'token'. Think of a token as a pair consisting of (i) a token type and (ii) a payload. The payload is the character. The token type is, at this point in the 6-step process, solely based on the character.

A superset of the following mapping is used:

\ -> :escape
Space -> :whitespace
Newline character -> :newline
( -> [:ldelim :parenthesis]
) -> [:rdelim :parenthesis]
. or : -> :maybe-fun
# -> :maybe-title
* or _ or - -> :maybe-magic
Everthing else -> :default

In the above list, everything following an arrow is Clojure code. In Clojure brackets delimit vectors and their elements are separated by spaces. Terms following colons are 'keywords'. You can think of keywords as strings for internal use (as opposed to input/output and string manipulation).

Notice that parentheses have a vector for a type. The same is true for [&npbsp;] brackets, {&npbsp;} braces, and <&npbsp;> chevrons, but we omit these delimiters here. Later, in the tokenization phase, we will also convert bullet items and indentation to delimiter tokens.

Also notice that some types are named "maybe-". That's a gentle reminder to ourselves that these tokens types are rough guesses. To wit, as in Markdown the string "# A Title" at the start of a line would be a section title, but "A#B" is never a title, even though the character # is initially always assigned the :maybe-title type.

Tokenization (2/6)

At the second stage we get more serious about tokens. To start, all successive occurrences of tokens of types :default, :whitespace, or :maybe-magic are merged. Thus the series [:default "N"] [:default "e"] [:default "a"] [:default "t"] is transformed into [:default "Neat"]. At this stage successive instances of two or more newlines are also transformed into :emptyline tokens. Everything following an escape token is given the :default token type and the escape tokens are removed. Indented blocks are wrapped in [:ldelim :indent] and [:rdelim :indent] tokens. Bullet items are wrapped in [:ldelim :bullet] and [:rdelim :bullet]. Finally, superfluous whitespace—e.g. whitespace before an [:rdelim :indent] token—is removed.

Abstract syntax tree generation (3/6)

In the first stage we converted a sequence of characters into a sequence of tokens. In the second stage we transformed this sequence of tokens into a more meaningful sequence of tokens. Notice, however, that this latter sequence implicitly describes a tree.

So far we have considered one kind of term—viz. tokens. We now add a new kind of term—blocks. Blocks consist of two delimiter tokens (one that matches the type [:ldelim _] and one that matches the type [:rdelim _]) and a sequence of terms. We refer to this sequence as the block's 'center'.

Abstract syntax tree (AST) generation is very intuitive. We simply replace all sequences of the (balanced) pattern [:rdelim _] terms* [:rdelim _] by a block term. For instance,

[:ldelim :indent] [:default "Hi"] [:whitespace " "] [:default "there"] [:rdelim :indent]

becomes

[:block [:ldelim :indent] [[:default "Hi"] [:whitespace " "] [:default "there"]] [:rdelim :indent]]

This is great because from here on we can say things such as 'an empty line, followed by an indented block, followed by another empty line' is a blockquote.

Desugarization (4/6)

Termcat wants to have its cake and eat it too. It wants to be nice and comfy to type in, like Markdown, but it also wants to be powerful and simple like TeX. Here's how.

There's a sugar-free version of Termcat that can do everything that regular Termcat can. However, it doesn't have any of Markdown's syntax. Instead, to add a section you would type ":section(This is a title)", to add emphasis you type ":emph(emphasis)" to create a bullet list you type ":bullet-list(Item 1)(Item 2)(Item 3)", and so on.

The desugarization step is responsible for translating Markdown syntax into sugar-free Termcat code. It results in sequences such as [:fun some-clojure-function] [:block [:ldelim :bullet] [...] [:rdelim :bullet]] [:block [:ldelim :bullet] [...] [:rdelim :bullet]]. Specifically, it results in :fun terms (similar to tokens, but with a Clojure function for payload) followed by zero or more :block terms, representing the arguments.

Lambda (5/6)

In the fifth stage computations are performed and HTML code is generated. Given a successive occurrence of a :fun term and one or more :block terms, the payload of the :fun term (a function) is invoked with the center of each block for arguments.

Roughly, the functions whose names start with a colon (:emph, :bullet-list, etc.) wrap their arguments in :html tokens. For example, ":emph(Test!)" is first rewritten to

[:fun colon-emph-function] [:block [:ldelim :parenthesis] [[:default "Test!"]] [:rdelim :parenthesis]]

and is then rewritten to

[:html "<emph>"] [:default "Test!"] [:html "</emph>"]

HTML output (6/6)

After all :fun terms have been evaluated we have a tree of terms of various types. In the sixth and final stage we iterate over all these terms and convert them to :html terms. For instance, the payload of :default tokens is HTML-escaped, :whitespace tokens are replaced by a space character, and :block terms are flattened.

So that's the gist of the Termcat rewriting system. I made some simplifications and omitted some details—the elephant in the room being mathematical expressions—but I hope to have shown that rewriting systems and markup languages can make for a good match. In the next installment I will explain how I'm implementing this system in Clojure.

Friday, October 4, 2013

Termcat: A Hacker School Project

I'm currently finding myself in the privileged position of attending Hacker School. For those who don't know, Hacker School is a three-month program in New York where a bunch of motivated people practice and learn about programming.

I have decided that I will spend much of my time at Hacker School on an open-source project that I've been thinking about for a while now. I call it Termcat. This was originally a codename, but it's growing on me and I might keep it.

Termcat is, or will be, a markup language that targets HTML for output. It is inspired by Markdown and LaTeX. With Markdown it shares the ideal that source code should be easy to read. With LaTeX it shares a concern for scientific writing and typography. Like LaTeX, my intention is for Termcat to be fully programmable and to allow graphics to be generated from code and data (via D3.js). Support for MathML is a priority.

I've been using LaTeX quite extensively for the past five years. It's no doubt a very powerful system. LaTeX comes with tens of thousands of amazing packages for many niche—and not so niche—use cases. All the same, LaTeX has many flaws. Most aggravatingly, its syntax makes LaTeX source code difficult to read. As a programming environment LaTeX is best described as primitive and bizarre. Moreover, error messages in LaTeX are arcane and many packages are incompatible with one another.

Unfortunately, there's currently no serious alternative for LaTeX if you need to enter a lot of mathematical expressions. For instance, word processors require that you open an equation editor every time you want to enter an equation or mathematical symbol—insofar they support mathematical notation at all. This makes word processors a non-starter for many scientists. There's MathML, but MathML is much too verbose to write by hand.

I believe it's possible to do much better. Consider the following LaTeX code:

$E = \{\langle a, n, n' \rangle \subseteq I \times N \times N \mid Pa \text{ and } n < n' \}$

This is rendered as follows:

For Termcat I intend to allow identical output (albeit in HTML and MathML) to be generated from code like this:

E = {<a, n, n'> :subseteq I :times N :times N | Pa \and n < n'}

The idea is that Termcat would recognize that '=' is a binary relation. This means it would know that the terms to the left and right of it are mathematical expressions. It would also understand that the curly brackets delimit the start and end of the expression on the right. The lack of a space between the symbols '<' and 'i' would indicate that <a, n, n'> is a tuple. Finally, 'and' would be rendered as plain text because it is escaped using a backslash.

I only started hacking on Termcat this week and the above doesn't work yet. So far I have a basic parser, HTML output, and support for Markdown titles and bullet lists. For the coming week I intend to focus on some foundational work and also get started on elementary support for mathematical notation.

I plan to blog once a week for the duration of my stay at Hacker School. True, I haven't exactly been a prolific blogger in the past, but this time around there's some people I owe $5 for every week that I don't blog. :) Consequently, you can expect a series of blog posts in which I expound on the ideas behind Termcat.

For those interested, the source code can be found at https://github.com/jdevuyst/termcat.

Jonas De Vuyst’s Blog