Monday, November 25, 2013

Thoughts on the Clojure threading macros

Most programmers who have dabbled in Clojure are probably familiar with the threading macros -> and ->>. For readers, who aren't. Here's what they do.

The threading macros allow you to write function composition in reversed order. For instance, using the threading macros (f (g (h x))) can be written as (-> x h g f) or (->> x h g f). Here, the first argument of the macros are evaluated as usual. The other arguments are assumed to be functions.

It's also possible to use -> and ->> for functions that expect multiple arguments. As such, the code (-> a (f b) g) is translated into (g (f a b)) and (->> a (f b) g) is rewritten as (g (f b a)). That is to say, every time the pattern (function arg1 arg2) is encountered, the result of the preceding expression is inserted as the first (->) or last argument (->>) of the function call.

The macro ->> is perhaps the more popular threading macro. It is very useful in combination with map, reduce, or filter—functions that take a function as their first argument and a collection as their last argument. To wit, you might write the following code:

(->> [-1 0 1 2]
     (filter pos?)
     (map inc)
     (map str))

This expression would first filter all positive elements from the vector [-1 0 1 2] (i.e. delete -1 and 0),  then increment the remaining elements, and finally convert them to strings. The result is the list ("2" "3").

The macro -> is useful for functions that perform operations on their first argument. For instance,

(-> { :a [1 2] }
    (get :a)
    (conj 3))

This evaluates to the vector [1 2 3].

Invoking object or protocol methods in Clojure abides by the following syntax: (method obj arg1 arg2 ...). If, in typical functional style, your methods return updates on 'obj' then the macro -> tends to be the more useful threading macro for composing method calls.

Combining -> and ->> can be a bit of a pain. Nesting ->> within an -> expression works well:

(-> {:a [0 1 2 3]}
    (get :a)
    (->> (filter pos?)
         (map inc))
    reverse)
; (4 3 2)

This works because -> inserts the result of (get :a {:a [0 1 2 3]}) as the first argument of the call to ->>. Unfortunately, however, there's no obvious way for nesting -> within ->>.

Using only -> and ->> there doesn't seem to be a good general way of dealing with long chains of composed function calls that expect the 'object' in different positions.

Another annoyance is that -> and ->> don't allow for control flow in an obvious way. That is to say, you can't easily use if-statements within an -> expression.

To be fair, you could use lambdas like this

(-> ...
    ((fn [x] (if (pred x)
                 (foo x)
                 (bar x)))))

but this is a less than stellar solution if only because it introduces a lot of extra parentheses.

(Granted, using the special lambda form #() one pair of parentheses can be eliminated, but then the trade-off is that #()-expressions cannot be nested.)

Where the Clojure threading macros appear to fall short, the Swiss arrow macros purport to be a solution. However, there are rather many of them and I'm not sure if you want other people to understand your code then I feel it's better not to use them.

Enter the macro as->. It's a built-in macro (as of Clojure 1.5) and it's pretty darn awesome. Here's the source code:

(defmacro as->
  "Binds name to expr, evaluates the first form in the lexical context
  of that binding, then binds name to that result, repeating for each
  successive form, returning the result of the last form."
  {:added "1.5"}
  [expr name & forms]
  `(let [~name ~expr
         ~@(interleave (repeat name) forms)]
     ~name))

That code is unusually short, don't you agree? Here's what it does. It transforms the code

(as-> [1 2 3 4] $
      (conj $ 5)
      (map dec $))

into

(let* [$ [1 2 3 4] 
       $ (conj $ 5) 
       $ (map dec $)] $)

and it returns (0 1 2 3 4). I think it's a real gem.

The macro as-> allows you to introduce control flow and it works very well with the -> macro. To wit, you can write things like

(-> ...
    (as-> x (if (pred x)
                (foo x)
                (bar x)))))
    ...)

It also works great in combination with the -> macro. For instance, you can write

(as-> [1 2 3] x
      (map inc x)
      (-> x
          vec
          (conj 5)))
; [2 3 4 5]

Isn't that just swell?

Monday, November 18, 2013

This is not a blog post

In my previous blog post I wrote that I was feeling pretty happy with my productivity. Unfortunately I didn't feel quite the same this week. It's not that I didn't do anything. I wrote plenty of code. However, most of my time was spent experimenting with new syntax and getting lots of fairly boring details right. This makes it feel like I didn't do anything 'real'.

Having written a PhD thesis, however, I know all too well that feeling productive and actually being productive are things that don't always correlate. After digging through my notes and my Github commits I managed to compile this (non-exhaustive) list of stuff I did since my last post:

  • I made sure 'weird' expressions such as p^y_a^z'_b compile to sensible MathML code. LaTeX will actually give you an error if you type that; Termcat interprets it as p^{yz}'_{ab}. Incidentally, the MathML for that is <msubsup><mi>p</mi><mrow><mo>{</mo><mi>a</mi><mi>b</mi><mo>}</mo></mrow><mrow><mo>{</mo><mi>y</mi><mi>z</mi><mo>}</mo><mo>′</mo></mrow></msubsup>. Talk about verbose!
  • I ripped out some features that were dear to me, such as the ability to use '<' and '>' to delimit tuples. This, I realized, was confusing in many contexts. (One feature I did keep is that the Termcat code <~ ... ~>  converts the delimiters to proper chevron glyphs.)
  • I made many parsing rules a lot stricter. This means there's less chance of accidentally triggering special Termcat features.
  • I improved the typography of Termcat documents.
  • I experimented with syntax for bound names and for lambdas.
  • I read up on finite state automata (which are used to match regular expressions) and finite state transducers (which can do substitutions). One outcome of this is the conjecture that if I change the semantics of my rewriting system somewhat then my rewriting rules may become composable. This means that instead of having to do a pass over all tokens for every rule, it may be possible to automatically combine all rules first and then run them all at once in a single pass. I already alluded to this in an earlier blog post and I might do some work on this this week.
So there. Maybe I wasn't so unproductive after all!

Thursday, November 7, 2013

Halfway

I'm halfway through Hacker School and Termcat is reaching a point where I'm thinking of demoing it internally. I will probably do so next week.

Here are some of the features I implemented the past few days:
  • Added lexically scoped variables.
  • Added support for selecting a calligraphic math font. The Termcat code +M is equivalent to the LaTeX code \mathcal{M}.
  • It's now possible to use HTML tags as well as numerical and named entities in Termcat without escaping them. They 'fall through' and are outputted as raw HTML.
  • Added special support for the unary minus operator: -x is syntactic sugar for -~ x. In contrast, for the binary operator the notation x ~-~ y is used (or x - y after setting up the necessary bindings).
  • Like in LaTeX, whitespace is now more wide if it follows a colon or full stop (but not an acronym).
  • It's now possible to write hé as h%'e and nǐ hǎo as n%3i h%3ao. (I should learn more about Pinyin and figure out if it would be possible to convert the standard notation ni3 hao3 automatically into nǐ hǎo!)
  • Arbitrarily wide spaces can be inserted using the notation %_1.2em. Manual kerning is possible using the notation %+2px or %-.5ex.
  • Added support for superscript (x^y), subscript (x_y)  and fractions (x|y) in math mode. Similar to italics and other decorators, the outmost brackets in an expression like (1 + x)|(2 - y) are removed, so as to allow for spaces in the numerator and denominator.
I'm quite happy with my productivity as of late. It seems the foundations I created are holding up fairly well, and this means I can add new features easily.

Next week I will be working on adding user-defined functions (lamdas) to Termcat. That should be fun!

Monday, November 4, 2013

How Termcat Parses Mathematical Expressions

In my first blog post on Termcat I explained that one of my primary goals was to create a markup language that has a more natural syntax for writing mathematical expressions than LaTeX.

I mentioned the following expression as an example:
In LaTeX the code for this expressions looks like this:
$E = \{\langle a, n, n' \rangle \subseteq I \times N \times N \mid Pa \text{ and } n < n' \}$
I set myself the goal to allow the same expression to be generated from the following code:
E = {<a, n, n'> :subseteq I :times N :times N | Pa \and n < n'}
I mostly have something like this (but better) working!

A heuristic approach

One of the driving ideas behind Termcat's syntax is that when it recognizes an infix operation then it should know that, normally, there's a mathematical expression to the left and to the right of that operation. It should also be able to make similar inferences from prefix and suffix operators.

By way of example, the 'raw' syntax for operators is as follows:
~=~ : infix operator =, automatic spacing
~~|~~ : infix operator |, forces normal spacing
~~! : postfix operator !, normal spacing to the left
#~~~ : prefix operator #, wide spacing to the right
Operators must be surrounded by whitespace on the side of the tildes.

Using this syntax the expression above can be encoded like this:
E ~=~ {~ ⟨~ a ~,~ n ~,~ n' ~⟩ ~⊆~ I ~×~ N ~×~ N ~~|~~ Pa and n ~<~ n' ~}
(There's magic involved in getting n' to display correctly, but let's ignore that. Also, MathML doesn't seem to define default spacing for '|' so it needs to be surrounded by double tildes.)

Termcat also has heuristics for parentheses, brackets, braces, and chevrons and this allows us to some of the tildes:
E ~=~ {<a ~,~ n ~,~ n'> ~⊆~ I ~×~ N ~×~ N ~~|~~ Pa and n ~<~ n'}
The output is nearly identical to what LaTeX generates:
The Termcat code can be simplified further. First, however, I need to introduce another Termcat feature.

Intermezzo: lexical bindings

I'm currently working on adding user-defined 'bindings' or substitutions of standalone words to Termcat. The idea is that
!bind(test)(*test*)
test
should be rewritten into
*test*
Bindings are lexically scoped, where scope is delimited by parentheses, brackets, braces, chevrons, indentation, and bullet list syntax determine scope. Hence
(!bind(test)(*test*)
test)

test
is rewritten into
*test*

test
Towards a more natural syntax for mathematical expressions

Bindings can be used to remove the remaining tildes in the Termcat code. Consider the following declarations:
!bind
- =
- ~=~
!bind
- ,
- ~,~
!bind
- subseteq
- ~⊆~
!bind
- *
- ~×~
!bind
- |
- ~~|~~
!bind
- \<
- ~<~
For now the idea is that these bindings have to be set in every Termcat document. I may add a default bindings table at a later point though. In any case, after the above bindings have been defined it should be possible to write the following code:
E = {<a , n , n'> subseteq I * N * N | Pa and n < n'}
That looks a lot more readable than the LaTeX code if you ask me! In fact, I think it's nicer than the syntax I originally envisioned.

One obvious further improvement might be to treat commas (and semicolons) as special by default. This would obviate the need to surround commas by whitespace. I will look into this too.