Sunday, May 18, 2014

Comprehend: A Clojure Library for Pattern Matching on Sets

I've been doing a lot of Objective-C programming lately. Objective-C is not the worst language, but last week I couldn't help but become nostalgic of the Clojure programming I did at Hacker School. At some point I could no longer help but open Good Ol' Sublime Text and experiment with core.logic.

My procrastinating was productive because I now have the pleasure of presenting Comprehend to you.

Comprehend is a Clojure library containing a setlike data type that is optimized for the idiom { ∀ patterns ⊆ S : expr }. In Clojure this ends up looking like this:
(c/comprehend S pattern1 pattern2 ... expr) ; => '(result1 result2 result3 ...)
I explain the syntax in quite a bit of detail on the Github page (repeat link). In this blog post I want to go into some technical aspects of the implementation.

The main ideas

In Comprehend core.logic.pldb is used to manage the indexes of indexed sets.

When adding an element X to an indexed set, here is what we do:
  • Add something akin to [:root-element (hash X)] to the logic database
Next, with Y = X initially, we recursively perform the following steps:
  • If Y is sequential then for every element Z of Y at position i:
    • Add [:vector-length (count Y)] to the logic database
    • Add [:vector-element (hash Y) i (hash Z)]
    • If Z is a (non-opaque) collection then repeat this algorithm with Y := Z
  • If Y is a map then we do something similar for both the keys and the values of Y. We don't store the size of Y, however, because in the case of maps we also want to match on supermaps.
  • If Y is a different kind of collection (e.g. a  set) then we do the same but we don't have to deal with both keys and values.
That's it! core.logic.pldb puts indices on all the elements in our tuples and this helps us perform fast queries. Next to the index we also store a dictionary from all the hashes in our tuples to the original objects.

So about those queries. We follow a similar procedure to translate patterns into tuples. Here are the main differences:
  • At compile time we determine what the unbound symbols in the pattern are.
  • Wherever we would previously put (hash Y) or (hash Z) in a tuple, we now need to be more clever:
    • If Y (or Z) was one of the unbound symbols then we just return it. (Nitpick: This is at runtime and core.logic will have made the symbol resolve to a core.logic 'lvar'.)
    • If Y (or Z) is a collection that contains one of the unbound symbols then we replace it by an lvar. We reuse the same lvar for identical patterns.
    • In all other cases we store the hash value after all.
We then hand core.logic all of the following: (a) the core.logic.pldb index, (b) the list of unbound symbols in the pattern, and (c) the tuples-with-variables. Core.logic will do its thing and returns a sequence of sequences of hashes. The outermost sequence can be thought of as a list of matches. Every innermost sequence contains hashes of assignments for (b). We map these hashes onto assignments using the dictionary that we stored next to the index.

Recall that we determined (b) at compile time. This means that at compile time we can also generate the following template:
(for [[var1 var2 ...] (do-all-of-the-above)] expr)
Thus, insofar the only unbound symbols in expr also occurred in (one of) the patterns, expr will be grounded.

What is it good for?

I wrote a quick and dirty benchmark. The idea being merely to verify that my library is faster than naive approaches for some problems.

In the benchmark I create a graph of N vertices and N random edges. I then compare several approaches to find all paths that consist of four edges.

The first approach I tried, of course, uses Comprehend.

I also implemented an alternative core.logic query that uses a vector representation of the graph as its input source. This code uses core.logic's membero operator and also core.logic's built-in support for unification of vectors with logic variables. At around N=50 Comprehend conclusively overtakes this implementation.

I also implemented a naive algorithm that uses 'for', 'filter', and argument destructuring. At N=50 this implementation is about 50% slower than the previous core.logic query.

That is to say, there is some proof that there exist use cases for Comprehend. The exact scope of use cases is not clear to me yet, however. This warrants further investigation.

Practical uses aside, I had a ton of fun implementing this library and definitely also learned a thing or two along the way.

Alright, now it's back to Objective-C for me!