Monday, December 16, 2013

The Command Line Interface for Termcat

Yesterday I created the long overdue command line interface (CLI) for Termcat.

Usage: java -jar termcat.jar [options] <document[.tc]>

The HTML output is stored in document.html.

  -b, --browse   Open HTML output in browser
  -w, --watch    Watch document for changes and recompile when changed
  -v, --verbose  Use verbose output

(When running from source code, substitute 'lein run' for 'java -jar termcat.jar'.)

The most exciting thing here is the --watch switch. When activated, Termcat will first compile the source document to HTML and then watch for changes in the source file. Whenever the source is modified, Termcat automatically recompiles it.

Convenience aside, --watch has two performance benefits. First, it's not necessary to start Java every time you want to recompile a Termcat document. Second, Termcat keeps a cache around of intermediate results when watching a document. This means that Termcat doesn't need to reevaluate the entire document for every minor update (although it's currently not nearly as effective at this than I reckon it could be).

I'm also happy about the --browse option. When the --browse and --watch features are combined, Termcat automatically reloads the browser every time it recompiles the Termcat document. This means you get live previews of your Termcat documents out of the box!

lein run --verbose --watch --browse blog-post.tc
Compiling... (Elapsed time: 737.621 msecs; cache size: 410 items)
HTML output stored in blog-post.html
Watching for changes...
Recompiling... (Elapsed time: 457.546 msecs; cache size: 500 items)
Recompiling... (Elapsed time: 26.785 msecs; cache size: 500 items)
Recompiling... (Elapsed time: 404.613 msecs; cache size: 502 items)
Recompiling... (Elapsed time: 17.539 msecs; cache size: 502 items)

Termcat uses Java's WatchService API for tracking document modifications. The cache is based on soft references.

Monday, December 2, 2013

Thanksgiving

Last week was a short week at Hacker School because of the holidays. I had an amazing first Thanksgiving indeed, but I also got some work done.

Last week I started porting my theorem prover to ClojureScript. My intention is to create a web-based front end for my theorem prover using D3.js to draw tableaux and models. I'm cautiously optimistic that this won't be too much work as my tableaux and models are nothing more than labeled DAGs.

My porting efforts came to a temporary halt when I encountered a bug in ClojureScript. This led me to take a closer look at the ClojureScript codebase—a productive distraction that resulted in a patch by yours truly.

I have also started working on a manual for Termcat.

Additionally, I have been working on a new rewriting system for Termcat that should be more powerful, easier to understand, and faster than the system that's in place now. The downside is that it has subtly different semantics and this means I can't just assume my current rules will keep working when I switch rewriting systems. I might be in for a bumpy ride!

Monday, November 25, 2013

Thoughts on the Clojure threading macros

Most programmers who have dabbled in Clojure are probably familiar with the threading macros -> and ->>. For readers, who aren't. Here's what they do.

The threading macros allow you to write function composition in reversed order. For instance, using the threading macros (f (g (h x))) can be written as (-> x h g f) or (->> x h g f). Here, the first argument of the macros are evaluated as usual. The other arguments are assumed to be functions.

It's also possible to use -> and ->> for functions that expect multiple arguments. As such, the code (-> a (f b) g) is translated into (g (f a b)) and (->> a (f b) g) is rewritten as (g (f b a)). That is to say, every time the pattern (function arg1 arg2) is encountered, the result of the preceding expression is inserted as the first (->) or last argument (->>) of the function call.

The macro ->> is perhaps the more popular threading macro. It is very useful in combination with map, reduce, or filter—functions that take a function as their first argument and a collection as their last argument. To wit, you might write the following code:

(->> [-1 0 1 2]
     (filter pos?)
     (map inc)
     (map str))

This expression would first filter all positive elements from the vector [-1 0 1 2] (i.e. delete -1 and 0),  then increment the remaining elements, and finally convert them to strings. The result is the list ("2" "3").

The macro -> is useful for functions that perform operations on their first argument. For instance,

(-> { :a [1 2] }
    (get :a)
    (conj 3))

This evaluates to the vector [1 2 3].

Invoking object or protocol methods in Clojure abides by the following syntax: (method obj arg1 arg2 ...). If, in typical functional style, your methods return updates on 'obj' then the macro -> tends to be the more useful threading macro for composing method calls.

Combining -> and ->> can be a bit of a pain. Nesting ->> within an -> expression works well:

(-> {:a [0 1 2 3]}
    (get :a)
    (->> (filter pos?)
         (map inc))
    reverse)
; (4 3 2)

This works because -> inserts the result of (get :a {:a [0 1 2 3]}) as the first argument of the call to ->>. Unfortunately, however, there's no obvious way for nesting -> within ->>.

Using only -> and ->> there doesn't seem to be a good general way of dealing with long chains of composed function calls that expect the 'object' in different positions.

Another annoyance is that -> and ->> don't allow for control flow in an obvious way. That is to say, you can't easily use if-statements within an -> expression.

To be fair, you could use lambdas like this

(-> ...
    ((fn [x] (if (pred x)
                 (foo x)
                 (bar x)))))

but this is a less than stellar solution if only because it introduces a lot of extra parentheses.

(Granted, using the special lambda form #() one pair of parentheses can be eliminated, but then the trade-off is that #()-expressions cannot be nested.)

Where the Clojure threading macros appear to fall short, the Swiss arrow macros purport to be a solution. However, there are rather many of them and I'm not sure if you want other people to understand your code then I feel it's better not to use them.

Enter the macro as->. It's a built-in macro (as of Clojure 1.5) and it's pretty darn awesome. Here's the source code:

(defmacro as->
  "Binds name to expr, evaluates the first form in the lexical context
  of that binding, then binds name to that result, repeating for each
  successive form, returning the result of the last form."
  {:added "1.5"}
  [expr name & forms]
  `(let [~name ~expr
         ~@(interleave (repeat name) forms)]
     ~name))

That code is unusually short, don't you agree? Here's what it does. It transforms the code

(as-> [1 2 3 4] $
      (conj $ 5)
      (map dec $))

into

(let* [$ [1 2 3 4] 
       $ (conj $ 5) 
       $ (map dec $)] $)

and it returns (0 1 2 3 4). I think it's a real gem.

The macro as-> allows you to introduce control flow and it works very well with the -> macro. To wit, you can write things like

(-> ...
    (as-> x (if (pred x)
                (foo x)
                (bar x)))))
    ...)

It also works great in combination with the -> macro. For instance, you can write

(as-> [1 2 3] x
      (map inc x)
      (-> x
          vec
          (conj 5)))
; [2 3 4 5]

Isn't that just swell?

Monday, November 18, 2013

This is not a blog post

In my previous blog post I wrote that I was feeling pretty happy with my productivity. Unfortunately I didn't feel quite the same this week. It's not that I didn't do anything. I wrote plenty of code. However, most of my time was spent experimenting with new syntax and getting lots of fairly boring details right. This makes it feel like I didn't do anything 'real'.

Having written a PhD thesis, however, I know all too well that feeling productive and actually being productive are things that don't always correlate. After digging through my notes and my Github commits I managed to compile this (non-exhaustive) list of stuff I did since my last post:

  • I made sure 'weird' expressions such as p^y_a^z'_b compile to sensible MathML code. LaTeX will actually give you an error if you type that; Termcat interprets it as p^{yz}'_{ab}. Incidentally, the MathML for that is <msubsup><mi>p</mi><mrow><mo>{</mo><mi>a</mi><mi>b</mi><mo>}</mo></mrow><mrow><mo>{</mo><mi>y</mi><mi>z</mi><mo>}</mo><mo>′</mo></mrow></msubsup>. Talk about verbose!
  • I ripped out some features that were dear to me, such as the ability to use '<' and '>' to delimit tuples. This, I realized, was confusing in many contexts. (One feature I did keep is that the Termcat code <~ ... ~>  converts the delimiters to proper chevron glyphs.)
  • I made many parsing rules a lot stricter. This means there's less chance of accidentally triggering special Termcat features.
  • I improved the typography of Termcat documents.
  • I experimented with syntax for bound names and for lambdas.
  • I read up on finite state automata (which are used to match regular expressions) and finite state transducers (which can do substitutions). One outcome of this is the conjecture that if I change the semantics of my rewriting system somewhat then my rewriting rules may become composable. This means that instead of having to do a pass over all tokens for every rule, it may be possible to automatically combine all rules first and then run them all at once in a single pass. I already alluded to this in an earlier blog post and I might do some work on this this week.
So there. Maybe I wasn't so unproductive after all!

Thursday, November 7, 2013

Halfway

I'm halfway through Hacker School and Termcat is reaching a point where I'm thinking of demoing it internally. I will probably do so next week.

Here are some of the features I implemented the past few days:
  • Added lexically scoped variables.
  • Added support for selecting a calligraphic math font. The Termcat code +M is equivalent to the LaTeX code \mathcal{M}.
  • It's now possible to use HTML tags as well as numerical and named entities in Termcat without escaping them. They 'fall through' and are outputted as raw HTML.
  • Added special support for the unary minus operator: -x is syntactic sugar for -~ x. In contrast, for the binary operator the notation x ~-~ y is used (or x - y after setting up the necessary bindings).
  • Like in LaTeX, whitespace is now more wide if it follows a colon or full stop (but not an acronym).
  • It's now possible to write hé as h%'e and nǐ hǎo as n%3i h%3ao. (I should learn more about Pinyin and figure out if it would be possible to convert the standard notation ni3 hao3 automatically into nǐ hǎo!)
  • Arbitrarily wide spaces can be inserted using the notation %_1.2em. Manual kerning is possible using the notation %+2px or %-.5ex.
  • Added support for superscript (x^y), subscript (x_y)  and fractions (x|y) in math mode. Similar to italics and other decorators, the outmost brackets in an expression like (1 + x)|(2 - y) are removed, so as to allow for spaces in the numerator and denominator.
I'm quite happy with my productivity as of late. It seems the foundations I created are holding up fairly well, and this means I can add new features easily.

Next week I will be working on adding user-defined functions (lamdas) to Termcat. That should be fun!

Monday, November 4, 2013

How Termcat Parses Mathematical Expressions

In my first blog post on Termcat I explained that one of my primary goals was to create a markup language that has a more natural syntax for writing mathematical expressions than LaTeX.

I mentioned the following expression as an example:
In LaTeX the code for this expressions looks like this:
$E = \{\langle a, n, n' \rangle \subseteq I \times N \times N \mid Pa \text{ and } n < n' \}$
I set myself the goal to allow the same expression to be generated from the following code:
E = {<a, n, n'> :subseteq I :times N :times N | Pa \and n < n'}
I mostly have something like this (but better) working!

A heuristic approach

One of the driving ideas behind Termcat's syntax is that when it recognizes an infix operation then it should know that, normally, there's a mathematical expression to the left and to the right of that operation. It should also be able to make similar inferences from prefix and suffix operators.

By way of example, the 'raw' syntax for operators is as follows:
~=~ : infix operator =, automatic spacing
~~|~~ : infix operator |, forces normal spacing
~~! : postfix operator !, normal spacing to the left
#~~~ : prefix operator #, wide spacing to the right
Operators must be surrounded by whitespace on the side of the tildes.

Using this syntax the expression above can be encoded like this:
E ~=~ {~ ⟨~ a ~,~ n ~,~ n' ~⟩ ~⊆~ I ~×~ N ~×~ N ~~|~~ Pa and n ~<~ n' ~}
(There's magic involved in getting n' to display correctly, but let's ignore that. Also, MathML doesn't seem to define default spacing for '|' so it needs to be surrounded by double tildes.)

Termcat also has heuristics for parentheses, brackets, braces, and chevrons and this allows us to some of the tildes:
E ~=~ {<a ~,~ n ~,~ n'> ~⊆~ I ~×~ N ~×~ N ~~|~~ Pa and n ~<~ n'}
The output is nearly identical to what LaTeX generates:
The Termcat code can be simplified further. First, however, I need to introduce another Termcat feature.

Intermezzo: lexical bindings

I'm currently working on adding user-defined 'bindings' or substitutions of standalone words to Termcat. The idea is that
!bind(test)(*test*)
test
should be rewritten into
*test*
Bindings are lexically scoped, where scope is delimited by parentheses, brackets, braces, chevrons, indentation, and bullet list syntax determine scope. Hence
(!bind(test)(*test*)
test)

test
is rewritten into
*test*

test
Towards a more natural syntax for mathematical expressions

Bindings can be used to remove the remaining tildes in the Termcat code. Consider the following declarations:
!bind
- =
- ~=~
!bind
- ,
- ~,~
!bind
- subseteq
- ~⊆~
!bind
- *
- ~×~
!bind
- |
- ~~|~~
!bind
- \<
- ~<~
For now the idea is that these bindings have to be set in every Termcat document. I may add a default bindings table at a later point though. In any case, after the above bindings have been defined it should be possible to write the following code:
E = {<a , n , n'> subseteq I * N * N | Pa and n < n'}
That looks a lot more readable than the LaTeX code if you ask me! In fact, I think it's nicer than the syntax I originally envisioned.

One obvious further improvement might be to treat commas (and semicolons) as special by default. This would obviate the need to surround commas by whitespace. I will look into this too.

Monday, October 28, 2013

Termcat: Random Remarks

Current Feature-Set

Here's what Termcat can currently do.

Words can be emphasized by writing *italics* for italics. Sentences or sentence fragments can be emphasized *(like this)*. The outermost parentheses are automatically removed: like this. There's also syntax for **bold** and _underlined_ text.

Blockquotes are created by indenting text using two spaces.

Bullet lists are created like this:

- This is notation
- for bullet
  list.

As demonstrated above, bullet items can be continued on a new line by indenting those subsequent lines with two spaces. Bullet lists can also be nested and bullet items may contain blockquotes or multiple paragraphs.

Numbered lists are created like this:

:numbered-list
- item 1
- item 2

Termcat actually interprets this as the (curried) function call :numbered-list(item 1)(item 2). In fact, all syntactic sugar is translated into such function calls. For example, *this* is actually rewritten as :emph(this).

Fun fact. A sentence fragment can also be italicized by writing this:

:emph
  this is a function argument, not a blockquote

or

:emph
- this is not a real bullet item

The reason for this is that bullet items, indented blocks, as well as text in parentheses or other brackets are internally represented as the same thing. Hence writing :foo(bar)(baz), :foo[bar]{baz}, or

:foo
- bar
- baz

all amount to the same thing.

Links are created using the syntax [Text](http://example.com).

Images are created like this: ![Alt text](example.jpg)

Finally, there's support for creating section titles:

# This is a section title (h1)
## This is a subsection title (h2)

Planned Features

I have implemented rudimentary support for mathematical equations. There are still a lot of details to work out, though, and I hope to finish this work by the end of this week.

I have also started thinking about a Termcat-JavaScript interface. Specifically, I'd love it if existing JavaScript components could be reused in Termcat. I'm not sure what the interface for that should look like, though.

Zach, one of the Hacker School facilitators, pointed me to x-tags (which is inspired by web components) and this might be one way out. If I dictated that JavaScript code should be called via x-tags then I wouldn't have to add much more to Termcat than the ability to output arbitrary HTML tags. Nevertheless, I'm not sure x-tags is the answer. For instance, I think that x-tags / web components syntax is not well-suited for describing graphs, whereas in LaTeX graphs can be somewhat elegantly described using PGF/TikZ.

Related Work

In my previous blog post I mentioned that there seemed to be a link between my rewriting rules and finite state automata. To that I want to add that there are also clear links to concatenative programming in that the input and output of every rewriting rule solely consists of pop and push operations on a stack. (Kudos to Vlad for that link.)

David, another Hacker School facilitator, pointed me to an interesting research and pedagogic program under the name 'nanopass compiler'. The idea in this program is to construct compilers out of many passes that each perform one kind of simple transformation. Sounds like a rewriting system to me! The big difference with the Termcat rewriting system seems to be that in a nanopass compiler you construct a formal grammar for every intermediate language, whereas I don't have a formal grammar at any point.

Friday, October 18, 2013

Termcat: Details On the Current Implementation and the Potential for Optimization

In my previous blog post I gave a general overview of the rewriting system behind Termcat. Below I explain how this system is implemented in Clojure. I also mention some potential optimizations.

This blog post assumes basic familiarity with Clojure and Termcat.

The General Case

Recall that rewriting rules transform trees of tokens into trees of tokens. Equivalently, with term a token or a sequence of tokens, they transform sequences of terms into sequences of terms.

In Clojure we represent rewriting rules by functions that take the following arguments:
  • An object state
  • A sequence result
  • A term v
Rules return pairs of the following elements:
  • An object new-state
  • An update sequence new-result
Rules have the following meta-data (meaning, things you need to know to apply the rule):
  • An integer padding-count
  • An object init-state
When we run (rewrite terms rule) this happens:
  • Every term t in terms that is a sequence is replaced by the result of (rewrite t rule). (Recursion ends when terms is a sequence of tokens.)
  • Next, we call (rule init-state (repeat padding-count nil) (first terms))
  • The function rule is called once for all other terms in terms (both tokens and sequences), where the state and result passed to each call is what the previous call returned
  • Finally, rule is called padding-count more times with nil passed for the last element
Interestingly, all but two rules fit the following simple pattern:
  • They do pattern matching on the term or token type (see my previous post) of v and of the last padding-count elements of result
  • The new result that they return is the old result, minus possibly its padding-count last elements, plus a number of new terms
The rule that converts the token sequence into an abstract syntax tree is the most notable rule that does not adhere to this pattern. It needs to be able to pop any number of elements from the accumulated result.

The other rule that violates the above pattern is the rule for evaluating functions. This is because this rule may need to pop any number of function arguments from the accumulated result. However, this is merely a matter of convenience. By using currying we could make this case conform to the aforementioned pattern.

A Macro For the Simple Pattern

Termcat uses a macro defrule to deal with the simple pattern. Using this macro a rule might look like this:

(defrule remove-superfluous-whitespace
  [state t1 t2]
  tt
  [_ (:or nil
          :emptyline
          [:block :indent]
          [:block :bullet]) :whitespace] [nil t1]
  [_ :whitespace (:or nil
                      :emptyline
                      [:block :indent]
                      [:block :bullet])] [nil t2])

The first argument assigns names to the states and the tokens that the rule wants to read. Implicitly, thereby, it also specifies the number of tokens it wants to read and the desired padding-count (number of tokens, minus 1). Here, it reads one element t1 from result. The second element t2 corresponds to v.

The second argument, tt, specifies that we want to match on the term-type of t1 and t2. The remaining arguments do pattern matching (using core.match) on state, (tt t1), and (tt t2).

A rule defined with defrule is expected to return a vector of the new state and the tokens that should replace the tokens it read. Alternatively, nil can be returned to indicate that nothing should change. This is also what happens if none of the patterns match the input.

Finite State Machines

The rules we discussed are simple enough that the matching part of the rules can be transformed into regular expressions or finite state machines. The difficulty is handling the case where the rules make substitutions efficiently.

In essence, transforming a Termcat document into an HTML document is a matter of applying a bunch of rules sequentially. That is, we're faced with something like this:

(-> (read-termcat-document "hello.tc")
    (rewrite tokenizer-rule1)
    (rewrite tokenizer-rule2)
    ...
    (rewrite generate-abstract-syntax-tree)
    (rewrite post-ast-rule1)
    (rewrite post-ast-rule2)
    ...
    (write-html-document))

It would be interesting to investigate whether or not the different rewriting rules could be automatically—using macros—merged into a single finite state machine. Ideally we would end up with something like this:

(-> (read-termcat-document "hello.tc")
    (rewrite merged-tokenizer-rules)
    (rewrite generate-abstract-syntax-tree)
    (rewrite merged-post-ast-rules)
    (write-html-document))

Parallelism and Fast Recompilation

All of the rewriting rules (including the non-simple ones) are deterministic. Their output depends only on the following factors: (i) the state, (ii) the tokens that they read from result, and (iii) v.

This is incredibly exciting! Here's why: Parallelism and Fast Recompilation.

Suppose we want to rewrite a dozen tokens using a rule rule1. The first token has the payload (or content) '1', the second token has the payload '2', and so on. What we could do is apply rule 1 in parallel to (i) the complete sequence of 12 tokens and (ii) to the sequence of token 7-12. These two processes are depicted below.
The coloring indicates which calls to rule1 have identical input (meaning the tokens they read from result, their state, and v are identical).

We can see that for token 7 the input to rule1 is different for both processes. This could be because in the first process rule1 rewrote token 5 or 6. It could also be because the state is different for both function calls. Here we should discard the output of the second process. When the first process arrives at the eight token, however, we see that its arguments for rule1 are identical to the arguments that the second process used. This is great. This means we can terminate the first process and append the result of the first process to the result of the second process (discarding those tokens that were emitted after the second process rewrote token 7 and that were not read in by rule1 upon rewriting token 8).

Similar optimization should be possible when the source code of a Termcat document is modified. Think fast (partial) recompilation.

Three caveats:

  1. Coordinating this sort of parallelism is inherently tricky. For small inputs it would definitely not be worth it. And what's 'small' in this context?
  2. Moreover, it's not clear if the overhead of comparing the arguments for rule1 would, in practice, ever be worth the time saved by not having to do redundant rewriting.
  3. This optimization is not applicable to the rule for generating abstract syntax trees since this rule (i) tends to keep a lot of state around and (ii) may need to read an arbitrary number of elements from result.
Note that much of the last two sections is speculation. I might very well dig deeper into these things at some point because I think that would be a lot of fun. Nevertheless, this will probably have to wait until after Hacker School since I want to focus on more basic features first.

Meanwhile, I'm approaching a point where I want to start demoing Termcat to people and get feedback. In my next blog post I will give an overview of the features currently implemented.

Saturday, October 12, 2013

The Termcat Rewriting System and Its Different Stages

In my previous blog post I described some of the syntax of Termcat, a markup language that I have been working on while attending Hacker School. In this post I'd like to outline how I'm implementing it.

Termcat documents can be thought of as a sequence of Termcat expressions. Compiling Termcat documents into HTML involves applying a series of transformations to these expressions until every one of them is reduced to a piece of HTML code. The concatenation of this HTML code, wrapped in some leading and trailing code, constitutes the output document.

When I say 'applying transformations' I'm being very literal. Everything after reading the file and before mixing in the leading and trailing code is implemented as a rewriting system. There are six stages to this process. The purpose of this blog post is to explain them one by one.

Pre-tokenization (1/6)

The Termcat document is initially loaded as a sequence of Unicode characters. Next, every one of these characters is mapped onto a 'token'. Think of a token as a pair consisting of (i) a token type and (ii) a payload. The payload is the character. The token type is, at this point in the 6-step process, solely based on the character.

A superset of the following mapping is used:

  • \ -> :escape
  • Space -> :whitespace
  • Newline character -> :newline
  • ( -> [:ldelim :parenthesis]
  • ) -> [:rdelim :parenthesis]
  • . or : -> :maybe-fun
  • # -> :maybe-title
  • * or _ or - -> :maybe-magic
  • Everthing else -> :default
In the above list, everything following an arrow is Clojure code. In Clojure brackets delimit vectors and their elements are separated by spaces. Terms following colons are 'keywords'. You can think of keywords as strings for internal use (as opposed to input/output and string manipulation).

Notice that parentheses have a vector for a type. The same is true for [&npbsp;] brackets, {&npbsp;} braces, and <&npbsp;> chevrons, but we omit these delimiters here. Later, in the tokenization phase, we will also convert bullet items and indentation to delimiter tokens.

Also notice that some types are named "maybe-". That's a gentle reminder to ourselves that these tokens types are rough guesses. To wit, as in Markdown the string "# A Title" at the start of a line would be a section title, but "A#B" is never a title, even though the character # is initially always assigned the :maybe-title type.

Tokenization (2/6)

At the second stage we get more serious about tokens. To start, all successive occurrences of tokens of types :default, :whitespace, or :maybe-magic are merged. Thus the series [:default "N"] [:default "e"] [:default "a"] [:default "t"] is transformed into [:default "Neat"]. At this stage successive instances of two or more newlines are also transformed into :emptyline tokens. Everything following an escape token is given the :default token type and the escape tokens are removed. Indented blocks are wrapped in [:ldelim :indent] and [:rdelim :indent] tokens. Bullet items are wrapped in [:ldelim :bullet] and [:rdelim :bullet]. Finally, superfluous whitespace—e.g. whitespace before an [:rdelim :indent] token—is removed.

Abstract syntax tree generation (3/6)

In the first stage we converted a sequence of characters into a sequence of tokens. In the second stage we transformed this sequence of tokens into a more meaningful sequence of tokens. Notice, however, that this latter sequence implicitly describes a tree.

So far we have considered one kind of term—viz. tokens. We now add a new kind of term—blocks. Blocks consist of two delimiter tokens (one that matches the type [:ldelim _] and one that matches the type [:rdelim _]) and a sequence of terms. We refer to this sequence as the block's 'center'.

Abstract syntax tree (AST) generation is very intuitive. We simply replace all sequences of the (balanced) pattern [:rdelim _] terms* [:rdelim _] by a block term. For instance,

[:ldelim :indent] [:default "Hi"] [:whitespace " "] [:default "there"] [:rdelim :indent]
becomes
[:block [:ldelim :indent] [[:default "Hi"] [:whitespace " "] [:default "there"]] [:rdelim :indent]]
This is great because from here on we can say things such as 'an empty line, followed by an indented block, followed by another empty line' is a blockquote.

Desugarization (4/6)

Termcat wants to have its cake and eat it too. It wants to be nice and comfy to type in, like Markdown, but it also wants to be powerful and simple like TeX. Here's how.

There's a sugar-free version of Termcat that can do everything that regular Termcat can. However, it doesn't have any of Markdown's syntax. Instead, to add a section you would type ":section(This is a title)", to add emphasis you type ":emph(emphasis)" to create a bullet list you type ":bullet-list(Item 1)(Item 2)(Item 3)", and so on.

The desugarization step is responsible for translating Markdown syntax into sugar-free Termcat code. It results in sequences such as [:fun some-clojure-function] [:block [:ldelim :bullet] [...] [:rdelim :bullet]] [:block [:ldelim :bullet] [...] [:rdelim :bullet]]. Specifically, it results in :fun terms (similar to tokens, but with a Clojure function for payload) followed by zero or more :block terms, representing the arguments.

Lambda (5/6)

In the fifth stage computations are performed and HTML code is generated. Given a successive occurrence of a :fun term and one or more :block terms, the payload of the :fun term (a function) is invoked with the center of each block for arguments.

Roughly, the functions whose names start with a colon (:emph, :bullet-list, etc.) wrap their arguments in :html tokens. For example, ":emph(Test!)" is first rewritten to

[:fun colon-emph-function] [:block [:ldelim :parenthesis] [[:default "Test!"]] [:rdelim :parenthesis]]
and is then rewritten to
[:html "<emph>"] [:default "Test!"] [:html "</emph>"]

HTML output (6/6)

After all :fun terms have been evaluated we have a tree of terms of various types. In the sixth and final stage we iterate over all these terms and convert them to :html terms. For instance, the payload of :default tokens is HTML-escaped, :whitespace tokens are replaced by a space character, and :block terms are flattened.

So that's the gist of the Termcat rewriting system. I made some simplifications and omitted some details—the elephant in the room being mathematical expressions—but I hope to have shown that rewriting systems and markup languages can make for a good match. In the next installment I will explain how I'm implementing this system in Clojure.

Friday, October 4, 2013

Termcat: A Hacker School Project

I'm currently finding myself in the privileged position of attending Hacker School. For those who don't know, Hacker School is a three-month program in New York where a bunch of motivated people practice and learn about programming.

I have decided that I will spend much of my time at Hacker School on an open-source project that I've been thinking about for a while now. I call it Termcat. This was originally a codename, but it's growing on me and I might keep it.

Termcat is, or will be, a markup language that targets HTML for output. It is inspired by Markdown and LaTeX. With Markdown it shares the ideal that source code should be easy to read. With LaTeX it shares a concern for scientific writing and typography. Like LaTeX, my intention is for Termcat to be fully programmable and to allow graphics to be generated from code and data (via D3.js). Support for MathML is a priority.

I've been using LaTeX quite extensively for the past five years. It's no doubt a very powerful system. LaTeX comes with tens of thousands of amazing packages for many niche—and not so niche—use cases. All the same, LaTeX has many flaws. Most aggravatingly, its syntax makes LaTeX source code difficult to read. As a programming environment LaTeX is best described as primitive and bizarre. Moreover, error messages in LaTeX are arcane and many packages are incompatible with one another.

Unfortunately, there's currently no serious alternative for LaTeX if you need to enter a lot of mathematical expressions. For instance, word processors require that you open an equation editor every time you want to enter an equation or mathematical symbol—insofar they support mathematical notation at all. This makes word processors a non-starter for many scientists. There's MathML, but MathML is much too verbose to write by hand.

I believe it's possible to do much better. Consider the following LaTeX code:
$E = \{\langle a, n, n' \rangle \subseteq I \times N \times N \mid Pa \text{ and } n < n' \}$
This is rendered as follows:
For Termcat I intend to allow identical output (albeit in HTML and MathML) to be generated from code like this:
E = {<a, n, n'> :subseteq I :times N :times N | Pa \and n < n'}
The idea is that Termcat would recognize that '=' is a binary relation. This means it would know that the terms to the left and right of it are mathematical expressions. It would also understand that the curly brackets delimit the start and end of the expression on the right. The lack of a space between the symbols '<' and 'i' would indicate that <a, n, n'> is a tuple. Finally, 'and' would be rendered as plain text because it is escaped using a backslash.

I only started hacking on Termcat this week and the above doesn't work yet. So far I have a basic parser, HTML output, and support for Markdown titles and bullet lists. For the coming week I intend to focus on some foundational work and also get started on elementary support for mathematical notation.

I plan to blog once a week for the duration of my stay at Hacker School. True, I haven't exactly been a prolific blogger in the past, but this time around there's some people I owe $5 for every week that I don't blog. :) Consequently, you can expect a series of blog posts in which I expound on the ideas behind Termcat.

For those interested, the source code can be found at https://github.com/jdevuyst/termcat.