I’ve teased it long enough: today we’re going to look at the real meat of turning declarative, something-that-looks-like-this patterns into fast, running, imperative, if-this-then-that code. The absolute final generation of working Scheme code will have to wait until our thrilling conclusion in part 4, but by the end of this episode you’ll probably at least have an idea of what that will look like.
Today we’re in the (extensible-match decision-tree) library, which takes patacts (the AST representation of a pattern match clause introduced last time) and produces an optimized decision tree. It starts with a foreboding comment:
;; This is some of the hardest code I have ever had to write.
And, indeed, not since I was a tweenaged baby hacker struggling through an online tutorial on how to implement A* pathfinding for my game have I had so many false starts and wrong directions in implementing a known algorithm. But these were problems with working out how to translate the idea into code – the idea itself is pretty intuitive!
Motivating example
To see how we benefit from an optimizing decision tree generator, let’s look at an example not all too dissimilar from the last example we saw in the previous part.
(define (uniq ls same?) ; remove adjacent equivalent elements from list
(match ls
('()
'())
((cons _ '())
ls)
((cons* y x ls*)
(if (same? y x)
(uniq (cons x ls*) same?)
(cons y (uniq (cons x ls*) same?))))))
For this input, a typical Lisp hacker’s first attempt at a pattern matcher implementation usually generates code that looks basically like this:
(define (uniq ls same?)
(let ((subject ls))
(define (clause-1)
(if (equal? subject '())
'()
(clause-2)))
(define (clause-2)
(if (pair? subject)
(let ((g0 (car subject)) (g1 (cdr subject)))
(if (equal? g1 '())
ls
(clause-3)))
(clause-3)))
(define (clause-3)
(if (pair? subject)
(let ((g2 (car subject))
(g3 (cdr subject)))
(if (pair? g3)
(let ((g4 (car g3)) (g5 (cdr g3)))
(let ((y g2) (x g4) (ls* g5))
(if (same? y x)
(uniq (cons x ls*) same?)
(cons y (uniq (cons x ls*) same?)))))
(no-matching-clause)))
(no-matching-clause)))
(define (no-matching-clause)
(assertion-violation 'match "no clause matches" subject))
(clause-1)))
The idea is simple: compile each clause to a procedure, and tail-call the next procedure’s clause as a ‘failure continuation’ at any point where you notice that the pattern doesn’t match.
This is a completely reasonable implementation for very many, nay, probably even a majority of use cases. It’s how everyone learning to write a pattern matcher starts out, and indeed probably how every individual pattern match compiler starts out (including extensible-match!) The syntax-case pattern matcher implemented by psyntax basically works like this, because the absolute fastest matching time isn’t a priority for something that runs at expand time. Alex Shinn’s pattern matcher implementation works like this – anything more complicated would probably be hell to write in pure syntax-rules, and the sheer portability Shinn managed to achieve by writing only in syntax-rules has let his pattern matcher take over the world, inefficiencies and the occasional fudging on semantics be damned. (By contrast, SRFI 262’s dependence on identifier properties will likely hold it back from any equivalent level of world dominance for some time yet, until more Scheme implementers start to catch on to how nifty they are.)
Some of the inefficiencies in this particular code will be optimized away by the Scheme compiler. I mentioned in the first part of this series that redundant let bindings are easy to get rid of. Unfortunately, more consequential inefficiencies are harder to deal with. Most significantly, even after optimization, this code will call pair? and cdr twice every time it goes through the loop: once for the second clause, then again in the third clause. While Scheme compilers are usually clever enough to eliminate repeated type checks when they occur within a single nested tree of code, putting each check in its own little procedure messes this compiler optimization up. The exception is Guile, because Andy Wingo developed a very clever compiler pass which can recognize common prefixes across these internal procedure boundaries. Also, very simple uses – like the last example from last episode – can have the clauses inlined into one another and thus end up in a nice nested tree for the compiler to work with; but as a data point, clause-3 above is already complex enough (or complex enough and referenced from enough places) that Chez’s cp0 doesn’t bother.
This repeated calling and branching is theoretically unsatisfactory, but the real philosophical issue in these inefficiencies – the thing that really motivated me to want to do better – is specifically that this generated code is nothing like what a programmer would write in the absence of pattern matching. Nobody would ever write code which had those repeated pair? calls. The real point I want to make by spending many lines of code on trying to do something better is that pattern matching is a good idiom, and programmers should use it. To that end, pattern matching should be a usable tool even in the hottest of hot loops, even on the slowest of microcontrollers: nobody should ever feel like they have to switch back to writing out if expressions by hand for performance’s sake. The fact that pattern matching is a declarative idiom, and declarative programming has something of a reputation for being inefficient, doesn’t exactly help its reputation here.
There are mountains of research into efficient compilation of pattern matching to avoid generating silly code like this. In the usual framing, there are two approaches: the backtracking automaton and the decision tree. The backtracking automaton, as its name implies, will still sometimes run the same tests on the same subject more than once, but tries to do so less than the naïve approach shown above which re-examines everything from scratch in every clause. The decision tree, on the other hand, will never re-examine any of its inputs, but may have much larger (exponentially larger!) code size than the automaton.
In fact these two approaches aren’t so different in practice – once you run the right time-saving optimizations for generating an automaton, or the right space-saving optimizations for generating a decision tree, the results are pretty close to one another. They’ll be nearly identical on very, very many use cases; it’s in the comparatively rare cases where they fall down that they fall down in different ways.
In the Scheme and Scheme-adjacent world, Racket’s pattern matcher by Sam Tobin-Hochstadt uses an optimized backtracking automaton after the technique by Le Fessant and Maranget (2001); extensible-match uses a decision tree after the technique by Maranget (2008). Yep, same guy! Luc Maranget is actually far from the only compiler researcher to have written papers on optimizing both techniques; but he is the only one whose papers now seem to be regarded as the somewhat definitive introduction to the state of the art for each of the two techniques.
Generating a decision tree
Maranget’s 2008 paper is heavy on notation, but mostly explains it as it goes along. It’s pretty accessible as European academia goes: if you’re good at reading and understanding other people’s code, you should be able to understand Maranget’s paper. It’s worth a read (or take a look at a shorter explanation by Colin James or David Nolen) – I’ll explain it here as it’s implemented in extensible-match, noting some (but not all!) differences to Maranget’s original presentation.
So, our input is a list of patterns and their associated actions, one for each clause. In fact, our input is always a list of rows of patterns – remember, we’re in %core-match-lambda which matches multiple patterns against multiple values, using the row-pattern AST type to group them together. Each AST pattern (and subpattern, but we’ll get to that) represents something we can do to get closer to picking a clause which matches the subject – meaning that with the row-pattern, we have a choice of multiple potential patterns to choose from. The ?-pattern, for example, represents that we could test its subject against a predicate; the var-pattern represents renaming an internal subject variable to a pattern variable visibly bound in the clause.
Let’s say we pick one pattern from the first row, since that’s the one we have to test first. (How we pick one is a topic we’ll get to later.) We’re going to act on this pattern – let’s say a ?-pattern testing pair? against our input x. What do we need to do?
What we ultimately want to do is generate (if (pair? x) <code-if-a-pair> <code-if-not-a-pair>); this gives us a hint that we also need to work out how to fill in those two slots. Well, what goes in those two slots?
The <code-if-not-a-pair> clause is easier to understand first, because if (pair? x) is false, the whole row can’t match any more: a row is an and combination of patterns, and one of the anded subpatterns just threw up false. So we can throw out that entire row and fill in <code-if-not-a-pair> with the code generated by repeating this process on that smaller list of rows with their actions.
But wait! What if the next row also contains a ?-pattern testing pair? on x – there’s no point testing that again, because it will be false a second time as well! So apart from throwing away the first row, we also walk through the remaining rows and throw away any of them which also depend on the same test before we repeat this process to generate the <code-if-not-a-pair>.
The <code-if-a-pair> is generated by a similar walk over the rows and recursion on the result. We don’t throw away the first row because in this case, that part of it did match. Instead, we remove it from the row and from any subsequent rows which make exactly the same test, preventing generating the same test again in the case where we know it succeeded as well as the case where we know that it failed.
That’s basically the core of Maranget’s algorithm. Maranget calls generating the new rows for <code-if-a-pair> specializing the patterns, and generating the new rows for <code-if-not-a-pair> defaulting the patterns – both with respect to a particular ‘constructor’, but because Scheme is different (more below), extensible-match calls these specializers. There are two base cases of the recursion: one when there are no rows left (no patterns matched; generate the code to signal a matching failure) and one when the top row only consists of irrefutable patterns (ones like _ wildcard patterns and var-patterns, which don’t have failure cases: in this case that row matched; generate the code to activate the corresponding action).
However, there’s one type of pattern in our AST which doesn’t have a failure case, but hides more subpatterns which do: the apply-pattern which calls some procedure on its subject variable and creates more subject variable(s) for the value(s) returned. Those subject variables are then available to be the subject of a subpattern of the apply-pattern.
For his equivalent of an apply-pattern Maranget makes a whole separate set of patterns when specializing, just containing the subpatterns. This probably works better for his compiler of ML patterns than it did when I tried in my compiler of Scheme patterns: for one thing, testing type (our ?-pattern) and deconstructing into values for subpatterns (our apply-pattern) are the same thing in his implementation; since ML is statically typed and pattern matching is the primitive means of deconstruction, the compiler has a lot more knowledge about the potential range of input values and what is and isn’t possible after each deconstruction step than extensible-match can get as a Scheme macro. In particular, Maranget’s patterns after each deconstruction step always form a matrix with columns for each value and rows for each clause. That’s not guaranteed by the structure of this AST (but explains why the term ‘row’ is used in extensible-match).
So instead, when an apply-pattern has been chosen as the specializer, each specialized row simply has the subpattern of that apply-pattern spliced into it. This means that each row-pattern can have a different length, depending on which apply-patterns have been specialized and had their subpatterns spliced in!
As one last point, in order to simplify the job of the procedures which pick a specializer and specialize and default the rows on it, after this splicing takes place (or more accurately before each recursive step), any rows nested within the rows are flattened out. If there were any or patterns, they’re also expanded into multiple copies of the entire patact, each with one of the or clauses, so the decision tree generator proper doesn’t need to look for specializers specially inside of or patterns either.
Representing a decision tree
For reasons that will become more apparent below and in the next episode, a decision tree is actually another kind of intermediate representation: we don’t represent it directly as Scheme code. Here’s its definition, slightly simplified:
(define-record-type dt-node
(fields success-branch failure-branch))
(define-record-type dt-test
(fields proc var)
(parent dt-node))
(define-record-type dt-apply
(fields proc var vars)
(parent dt-node))
(define-record-type dt-equal
(fields val var)
(parent dt-node))
(define-record-type dt-rename
(fields internal external)
(parent dt-node))
Immediately it’s clear how much simpler this is than the AST we got as input: only four concrete node types vs nine pattern types in the AST. The action type from the AST also appears in decision trees, as the leaf nodes, where a matching clause has been chosen or the match failed.
?-patterns, when chosen, become dt-test nodes; apply-patterns become dt-apply nodes; quote-patterns become dt-equal nodes; var-patterns become dt-rename nodes. and-patterns, or-patterns, row-patterns, and not-patterns are reflected in the structure of the tree rather than as nodes themselves. wildcard-patterns are no-ops, as far as the pattern match compiler is concerned, and don’t show up in the tree at all.
Step by step
Here’s a worked example. Let’s take the last example from the previous episode:
(define (last ls) ; returns the last element in a list
(match ls
((cons x '()) x)
((cons _ ls*) (last ls*))))
Our AST for this pattern match expression looks like this (in an entirely hypothetical notation):
((patact (row (and (? #:subject ls #:predicate pair?)
(row (apply #:subject ls
#:procedure car
#:vars (ls.car)
#:subpattern (row (var #:subject ls.car
#:name x)))
(apply #:subject ls
#:procedure cdr
#:vars (ls.cdr)
#:subpattern (row (quote #:subject ls.cdr
#:datum ()))))))
(action return x))
(patact (row (and (? #:subject ls #:predicate pair?)
(row (apply #:subject ls
#:procedure car
#:vars (ls.car)
#:subpattern (row (wildcard #:subject ls.car)))
(apply #:subject ls
#:procedure cdr
#:vars (ls.cdr)
#:subpattern (row (var #:subject ls.cdr
#:name ls*))))))
(action do-last ls*)))
We look at the first row, which only has one subpattern and thus only one potential specializer to pick: (? #:subject ls #:predicate pair?). So we generate a dt-test node which checks if ls is a pair.
In the success branch of this dt-test node, we put the result of re-running this algorithm on all of the patacts with this specializer applied. We don’t need the pair? any more in either row, so it will disappear and be replaced by the right-hand side of the and pattern which contains it. This in turn is a row in both cases; we splice it into the main row of each patact. Our specialized patacts now look like this:
((patact (row (apply #:subject ls
#:procedure car
#:vars (ls.car)
#:subpattern (row (var #:subject ls.car
#:name x)))
(apply #:subject ls
#:procedure cdr
#:vars (ls.cdr)
#:subpattern (row (quote #:subject ls.cdr
#:datum ()))))
(action return x))
(patact (row (apply #:subject ls
#:procedure car
#:vars (ls.car)
#:subpattern (row (wildcard #:subject ls.car)))
(apply #:subject ls
#:procedure cdr
#:vars (ls.cdr)
#:subpattern (row (var #:subject ls.cdr
#:name ls*))))
(action do-last ls*)))
As for the failure branch of dt-test: we know that the first patact can’t match any more just because this single test failed. We have to default on the second patact only – and that also required (pair? ls) to be true. So our defaulted patacts in this case are just the empty list, and the recursion on that will generate the leaf node to signal an exception because no clause matched that input.
Back on the success branch, we still have these two patacts, but this time the first row gives us a choice of what to do. We could pick the leftmost one – the car – but actually the car isn’t very interesting because it’s not refutable. The cdr, on the other hand, can fail, so let’s prioritize it. We generate a dt-apply binding the cdr of ls to ls.cdr; it only needs a success branch since we assume this can’t fail. So we specialize again, and the specialized rows, after splicing, look like this:
((patact (row (apply #:subject ls
#:procedure car
#:vars (ls.car)
#:subpattern (row (var #:subject ls.car
#:name x)))
(quote #:subject ls.cdr
#:datum ()))
(action return x))
(patact (row (apply #:subject ls
#:procedure car
#:vars (ls.car)
#:subpattern (row (wildcard #:subject ls.car)))
(var #:subject ls.cdr
#:name ls*))
(action do-last ls*)))
Still two patterns in the top row, and again, the quote pattern looking at ls.cdr is still considerably more interesting than the apply pattern, so we pick that as the specializer again. We generate a dt-equal node checking if ls.cdr is the empty list. This one does need success and failure branches.
On the success branch, the specialized patterns look like this:
((patact (row (apply #:subject ls
#:procedure car
#:vars (ls.car)
#:subpattern (row (var #:subject ls.car
#:name x))))
(action return x))
(patact (row (apply #:subject ls
#:procedure car
#:vars (ls.car)
#:subpattern (row (wildcard #:subject ls.car)))
(var #:subject ls.cdr
#:name ls*))
(action do-last ls*)))
We have completely eliminated the cdr patterns from the top row! Only the irrefutable car pattern is left, and when the first row is irrefutable, the pattern has matched – so we generate the dt-apply node and its dt-rename child and have it point to the action which returns x. Done!
On the failure branch of the check for a null ls.cdr, we’ve thrown away the top row again and we end up with this single patact:
((patact (row (apply #:subject ls
#:procedure car
#:vars (ls.car)
#:subpattern (row (wildcard #:subject ls.car)))
(var #:subject ls.cdr
#:name ls*))
(action do-last ls*)))
Once again, this is irrefutable, so we again generate the dt-apply node for ls.car, the dt-rename node for the ls* variable, and then finish at the action.
Picking a pattern
Okay so, at each step of the pattern matching process, we pick a specializer from the first row’s patterns, generate a node from it, and specialize/default the remainder of the patterns on it. But how do we pick which pattern to choose from the row?
We have to work out which pattern is going to help us find answers the fastest, probably by comparing against the patterns in different rows somehow. We need a heuristic – a scoring function to guess which pattern is most likely to help find the matching clause in the fewest nodes possible.
Maranget considers the question of which heuristics to use, assigning letters to each of a number of potential scoring functions and measuring the impact of many potential combinations of them. In Maranget’s terms, extensible-match uses heuristic fnr: the specializer picked has to be in the first row; the specializer is the one which will specialize the most rows (most needed); if there’s a tie for the most needed specializer, we pick the pattern with the smallest combined size of the specialized and defaulted rows. (If even that goes to a tie, just pick the leftmost tied specializer.) I didn’t benchmark this particularly: n seemed like an intuitively sensible heuristic; based the suggestion in Maranget’s paper, I later added r as an additional tie-breaker.
Some of the heuristics Maranget considers aren’t really applicable to Scheme. His ML version can generate nodes with more outward branches than just success and failure, because the ML compiler can generate a single switch over many types at once; Scheme detects different types by calling individual predicate procedures for each individual type, so that’s not possible, and neither is the heuristic b which picks the smallest switch to build, for example – so extensible-match also can’t use the combination of heuristics Maranget personally recommends, pba.
So why was it so hard/why didn’t you do it slightly differently
As always, the hard thing was not the actual writing of the code, but working out what the right code to write was.
For example, you might ask why I wasn’t tempted to skip the step of representing decision trees as their own, reified type and simply recurse by generating new match expressions nested within the branches of an if or within the body of a let for the specializer we’ve chosen, for example.
This is, indeed, very tempting. If you’re implementing a compiler based on decision trees, you might even find some literature suggesting that you could do exactly this. Surely, you’ll think, just like those Swap nodes in Maranget’s paper aren’t actually part of how it’s really compiled – surely you can avoid generating the other nodes per se too, and just go straight to generating the code which those nodes will eventually be turned into?
So you try it, and it doesn’t work, and you regret it, and you commit it to the repository anyway as a warning to your future self and to others. Actually there were other things that were wrong with that first attempt – trying to insist that patterns form a matrix-like grid turned out to be the wrong thing for this kind of pattern input AST. My idea was that I really wanted to be able to property test the decision tree generator, ensuring that for any generated patterns and terms it always had the same result as the naïve style of generation shown above. I was thinking of decision tree generation as an optional optimization pass, rather than a fundamental transformation pass. This is the wrong mental model to have.
Anyway, the real reason that would have been the wrong approach even if it had worked is that it wouldn’t have allowed any optimization for code size or compilation speed. Wait whaaaaaat??
Optimizing space and time
Yeah, remember how I mentioned about the potentially exponential code size if you choose decision trees, but only in cases where the tricks to avoid it start to fail? It’s time to talk about those tricks.
The most important trick is sharing of nodes. When two different branches in the tree end up doing the same things, they should actually re-converge and become the same branch again. This makes the decision tree actually a decision dag (directed acyclic graph).
To achieve this sharing of nodes between different code paths, extensible-match uses two tricks. The most important one is hash consing, a trick more generally applicable in functional programming. When you have immutable structures which might be created many times over with the same contents, you can maybe save memory by storing all instances of structures of that type in a hash table. When someone asks to create a new instance of the type, first check if there’s already one existing which has the contents they’re interested in. If so, just return that existing one instead of creating a new one. (But if not, be sure to put the newly-created object into the table before returning it, in case someone asks for another one of the same in future.)
Neither R6RS’s record system nor its hash tables were really meant to support doing this, and the (extensible-match decision-tree) library has to do some pretty undignified nonsense in order to make it work. But it does work, and saves both time and space: space obviously, but also time because it stresses the garbage collector less to have fewer objects sitting around.
There’s a second layer of caching which is similar, one I mentioned I was considering using in an earlier post: the recursive procedure which generates tree nodes from a (specialized or defaulted) set of rows and actions is memoized. This has a similar effect to hash consing but, since hash consing already guarantees sharing, it optimizes compilation time and not decision tree size. Memoization avoids repeating an exponential recursion entirely in many cases. I measured how many cache hits the memo table got for some complex patterns, and how much it sped things up, and it was totally worth it. The 14 cases over three list variables which used to take 9.3 seconds to compile on Guile now comes back almost immediately.
Visualization
What’s the most common target language for compilers? x86? The JVM? C? JavaScript? If I were a betting lass, I might be tempted to put money on the answer being DOT, the GraphViz file format. If you want to understand what your compiler is doing, the easiest way to see it is often to actually have it draw your control-flow graph, automaton, or decision dag in two dimensions. Outsourcing the actual drawing task to GraphViz is a nice sweet spot in the trade-off between getting it working and making it nice.
Initially I tried to see what was going on with a very silly (non-hygienic) code generator outputting Scheme code, but this quickly got tiring to read, especially because it couldn’t show sharing in the dag. (How the real code generator represents shared nodes is a topic for next time.) With a 2D graph, it’s relatively easy to get a rough impression on first glance of how well the decision tree generator handles certain cases; by following the arrows, you can fairly easily check for anything it’s doing wrong.
I’ve kept an archive of older DOT files showing how the pattern match compiler has got more capable over time, going back to some of the first decision trees this code ever generated, before it was even hooked up to a code generator. (This DOT file is dated 2 December 2024, and I didn’t get the code generator working until Candlemas.) This lets me see how certain changes improved or disimproved the generated trees over time.
Originally, to review these DOT files, I had to manually copy the generated DOT source into a file and invoke the dot command line tool. But recently I thought, hey, I have a terminal emulator which can display inline graphics, so why not just show them directly in the REPL? So I wrote a little library to do just that.
Source code of the library which plops a drawing of a graph straight into my terminal
(library (extensible-match dot)
(export show-dot)
(import (rnrs (6))
(only (guile) mkstemp port-filename waitpid)
(srfi :39)
(ice-9 popen))
(define-syntax show-dot
(syntax-rules ()
((_ expr)
(let* ((dot-port (mkstemp "/tmp/dot-XXXXXX"))
(dot-file (port-filename dot-port)))
;; It was probably a mistake that decision-tree->dot writes
;; to current-output-port, but I make the best of it:
(parameterize ((current-output-port dot-port))
expr)
(close-port dot-port)
(let-values (((from to pids)
(pipeline
`(("dot" "-Tpng" ,dot-file)
;; I tried out both sixel with img2sixel
;; (more widely supported) and the inline
;; bitmap functionality of iTerm2 (my
;; terminal emulator of choice) with this
;; imgcat program. Sixel was very fragile
;; when the terminal window got resized (the
;; bitmap area would just unrecoverably
;; blank out sometimes), so I went with
;; imgcat. It’s for my own use only, anyway,
;; and not too hard to change if needed:
("imgcat" "--iterm2")))))
(close-port to)
(let ((stdout (standard-output-port)))
(put-bytevector stdout (get-bytevector-all from))
(flush-output-port stdout))
(close-port from)
(for-each waitpid pids)))))))
Enough of the talking, what do these decision trees actually look like, then? Here’s the drawing for the last example we worked through above.
The shapes of the nodes are the classic flowchart style: diamonds for decisions, rectangles for steps, and circles for termination.
Note that the do-last path also generates a call to (car ls) even though it doesn’t use the value of the car. In theory, it should be smart enough not to do this, but I decided for now to leave that optimization to the Scheme compiler on the theory that it can probably do a better job of it anyway.
For another simple example, this is a type-safe list merge.
(define list-merge
(match-lambda
(('() '()) '())
(((? pair? xs) '()) xs)
(('() (? pair? ys)) ys)
(((cons x xs*) (cons y ys*))
(if (< y x)
(cons y (list-merge (cons x xs*) ys*))
(cons x (list-merge xs* (cons y ys*)))))))
You can see the distinction between subject variables (arg1, arg2, arg1_car, arg1_cdr, arg2_car, arg2_cdr) and pattern variables. You can follow through and see how it avoids repeating the pair? check on both arguments for the final clause – there are two different pair? checks on arg2, but they’re in mutually exclusive code paths. There’s a long chain of dt-apply (receive) and dt-rename (let) nodes at the end, because the decision tree generator puts off those steps, which don’t affect whether the patterns match or not, until the end.
For a more complex example, let’s look at Chris Okasaki’s red-black tree balancing pattern (paper, book). This is probably pattern match compiler authors’ favourite example to use, because
it’s nontrivial
it’s real and not contrived
it’s a ‘hot’ function which gets called log2n times for every insertion into a tree of size n, making it important to optimize well
it’s a pretty amazing example of using pattern matching to translate reasoning about data structure invariants directly into running code, in a way that would be difficult without pattern matching
What does it do? Basically a red-black tree stays relatively well balanced by colouring the levels of nodes red or black, and maintaining two properties about where coloured nodes are allowed to appear relative to one another. One of these rules is that no red node is allowed to touch another red node: black touching black is okay; red touching red isn’t. So Okasaki works out that after finding a place to insert a new red node, there are exactly four ways in which a tree can end up ‘unbalanced’ in having two adjacent red nodes; in that case the tree needs to be ‘rotated’ a little to restore balance.
He writes that like this (but in ML):
(define (balance n)
(match n
((or (node 'black (node 'red (node 'red a x b) y c) z d)
(node 'black (node 'red a x (node 'red b y c)) z d)
(node 'black a x (node 'red (node 'red b y c) z d))
(node 'black a x (node 'red b y (node 'red c z d))))
(node 'red (node 'black a x b) y (node 'black c z d)))
(_ n)))
a, b, c, and d are values in the tree; x, y, and z are child nodes. The sort order of the values and the values of the child nodes is a < x < b < y < c < z < d. Visually explained, the job of this procedure is to turn one of these wrong trees:
Into this right tree:
Here’s what the decision dag looks like:
You can see that nodes are shared (1, 2, 3, 4) to reduce the overall size of the graph. This may look big and complicated, but at run time it’s efficient because no test or extraction will ever be repeated on any given code path. There are 13 branch points, the same number as in Colin James’s independent OCaml implementation of Maranget’s algorithm, a nice confirmation that my implementation is doing the right thing :-)
Can you imagine trying to write out the code for this pattern match by hand? You don’t have to imagine, because Rich Hickey implemented this data structure in bad old Java as part of Clojure. His version is spread all over the place because old-school Java just lacked the features to make writing this function nice. (I’m not even sure new-school Java could do much better.)
For an example where things start to fall down in Scheme, consider Maranget’s bytecode interpreter for a miniature dialect of ML. In figures 6 and 7 of his 2008 paper – which really ought to be shown as motivating examples on the first page – Maranget showed how a bad and a good decision tree for this can look in OCaml, a statically-typed language.
First of all, here’s roughly what the run function would look like in Scheme:
(define (run a s e c)
(match (values a s c)
((_ _ (cons (ldi i) c)) (case-1))
((_ _ (cons (push) c)) (case-2))
(((? integer? n2) (val (? integer? n1)) (cons (iop op) c)) (case-3))
((0 _ (cons (test c2 _) c)) (case-4))
(((? integer?) _ (cons (test _ c3) c)) (case-5))
((_ _ (cons (extend) c)) (case-6))
((_ _ (cons (search k) c)) (case-7))
((_ _ (cons (pushenv) c)) (case-8))
((_ (cons (env e) s) (cons (popenv) c)) (case-9))
((_ _ (cons (mkclos cc) c)) (case-10))
((_ _ (cons (mkclosrec cc) c)) (case-11))
(((clo cc ce) (cons (val v) s) (cons (apply) c)) (case-12))
((a (cons (code c) (cons (env e) s)) '()) (case-13))
((a '() '()) (case-14))))
You can already see that we’re running up against the limits of what it’s sensible to do in Lisp notation here.
What does it look like when compiled? I shared what it looked like in the very first working versions of the decision tree generator above … and today it looks like this. To be sure, there’s improvement there (196 nodes in the December 2024 version, 135 nodes today). But it’s also very far from the nice, elegant dag of just a dozen or so branches which Maranget shows in his paper, even accounting for the fact that his version can switch over many types at once and mine can’t. What gives?
Well, here dynamic typing messes up opportunities to do better. If you trace down the graph from the top, it looks good until you get to the third case. Once the pattern matcher knows that the current instruction (the first element in the list c for code) is an iop but the top of the stack s isn’t a val, it should jump straight to fail because there is no situation in which any of the other patterns can match. But it doesn’t know enough to know that; what if something might match both iop and test, or pushenv? A lot of these nodes are therefore dead, and will never execute except to go a long, slow way to a case that should never happen when this function is actually running anyway. I’ve been meaning to study in details exactly which branches are impossible, and see if it would be possible to do anything at all better here, like allow users to declare which predicates in ? are mutually exclusive. (Initial results on that particular feature weren’t exactly encouraging either. Update, 30 November: Scratch that! I just tried it again and it works great. 86 nodes, down from 135. I must have forgotten to clear the cache of pre-compiled code before the last time I tested it, or something. The problem now is that a complete implementation of this would require implementing something like Datalog …) It does do some very basic stuff to avoid generating redundant nodes like this, but not nearly as much as an ML compiler can.
As the folklore predicts, Racket’s backtracking automaton does better here for code size but worse for speed, generating many checks that c is a pair and many separate extractions of its car and cdr. It only does these things once for cases 1 and 2, but again for case 3, again for case 4, again for case 5, and only gets back on a roll again with cases 6, 7, and 8 …
Is all of this worth it?
Last of all, here’s the hand-wringing section where I ask whether I should have bothered with any of this!
In 2000 Kevin Scott and Norman Ramsey wrote a draft paper asking ‘When Do Match-Compilation Heuristics Matter?’; their answer is ‘almost never’. Out of 18 program benchmarks, only five were affected in decision tree size by the use of different heuristics, and three of those were made worse by choosing heuristics other than simple left to right selection of patterns. More encouraging is that 13 of the 18 benchmarks were artificial benchmark programs, but one of the five where heuristics did make some difference was a real-world program (the SML/NJ compiler).
Maranget’s results measuring heuristics may be more representative because he created a corpus of match expressions to test on specifically designed to represent a wide range of sizes and matching techniques; and because he actually used his own decision tree generation algorithm, the one extensible-match’s is based on. He finds a larger difference than Scott and Ramsey – about 20% improvement in code size and 10% improvement in average path length for any individual heuristic, compared to just naïvely picking the leftmost pattern in each row every time. As a reflection on how Maranget’s corpus might not be exactly representative, though, the heuristics only score about 15% better for code size and about 3% better for average path length than naïvely picking the rightmost pattern in each row.
Still, these are fairly slim pickings. As mentioned above, Guile’s optimizer is already capable of taking a naïve pattern match compilation and doing optimizations equivalent to generating a backtracking automaton that always picks the leftmost pattern for each operation; basically on the bytecode machine example it would probably do about as well as Racket’s match compiler, even without any help from the macro. The benchmarks by Scott and Ramsey suggest this should be good enough for almost every real use case. If Guile ever ships SRFI 262 in the standard distribution, it might well not be worth the extra weight in code to include a heuristic-driven decision tree generator; maybe that version should go back to just generating the naïve kind of compilation shown in the very first example above, and should let Guile’s own optimizer do the work. Changing strategy to the automaton style in that case also seem like the right thing, given that any version shipped with Guile would probably also end up in Hoot, which has its own reasons for optimizing code size. For other Scheme implementations which don’t (yet) do the online common subexpression elimination pass Andy Wingo developed, the winnings will definitely be bigger – although still diminishing if online CSE becomes more common.
Next time
Whew, this was a long one. Next time will be a comparatively short conclusion, on the code generator.