A story about data representation and performance

29 September 2025

If you go poking around the source code for extensible-match, you’ll find places where I define auxiliary syntax keywords and use syntax objects like records belonging to some sum type. Like here or here or here.

Extensible-match belongs to the comparatively rare class of Scheme macros that are complex enough that they count as compilers in themselves. As Olin says when defending the use of records inside of his loop macro implementation, once you recognize that your macro has reached that level of complexity, it’s simply a matter of good programming practice to adjust your data representation and use records instead of trying to make ad hoc transformations on S-expressions. This goes up even to things that are themselves unambiguously compilers: the Nanopass framework helped compiler courses at Indiana and then Chez Scheme itself make the leap from trying to do everything with S-expressions to using proper records to represent and transform the input code. So why does extensible-match use these weird syntax object pseudo-records?

Well, there’s a story about that, and I consider myself to have learned my lesson here: eventually these things will probably mostly disappear.

Since extensible-match is a compiler, I thought when writing that I would benefit from being able to use a pattern matcher – but I also didn’t want to get into the mess that psyntax has where you have to use a pre-expanded version of the pattern matcher in order to be able to work on a new version. So I messily decided to abuse syntax objects as records to be able to consistently use syntax-case as the pattern matcher used to implement another, more general pattern matcher.

This came back to bite me quite badly.

The algorithm extensible-match uses to generate an optimal set of conditionals from patterns is Luc Maranget’s decision tree algorithm. The first problem that started to appear with using syntax objects here is that they don’t, in general, have a well-defined equivalence relation. For example, in order to reduce the number of cases the main decision tree generator has to deal with, there are some ‘source-level transformations’ which simplify some complex cases in the patterns themselves before generating the decision tree. One source-level transformation is rewriting disjunctive (or) subpatterns as separate top-level patterns. Another handles the case that a ‘row’ – essentially an and pattern where the order of checking subpatterns isn’t fixed – contains another row, by flattening the inner row into the outer row. But what if an or contains a row, or vice versa? The easiest thing to do is to iterate these source-level transformations to a fixed point: repeat the process, each time feeding last time’s output back in, until the output is the same as the input, at which point we know there’s nothing more that needs to be done. But you need an equivalence relation to be able to test if the output was the same as the input! So I had a pretty awful procedure that tried to tell if the transformation procedure would find any transformations to do on its input, which in any other circumstance would be the worst kind of premature optimization.

But the real problem came with the performance of Maranget’s algorithm. The algorithm has basically the same run-time complexity as the naïve recursive Fibonacci function: each step of generating the decision tree recurses on a set of patterns which is a little bit smaller than the input set, and on a set that’s a little bit smaller than that again. The hardest torture test I had when implementing Maranget’s algorithm was a port of the bytecode interpreter which is Example 3 in his paper. On Chez Scheme on a pretty recent MacBook, it took 8 seconds to generate the decision tree when I represented patterns this way. This is just 14 cases; Chez is one of the best-optimizing compilers around, and running on most implementations it would be even slower. 8 seconds for a moderately complex match is already unacceptable for incremental development – and once extensible-match starts supporting more Scheme implementations, for many people it would be even worse, since probably no other implementation would be as fast as Chez at this, and many people have slower computers than mine.

Finding the cause of this performance problem was a nightmare. It was a classic ‘peanut butter’ performance problem: because it was smeared out over each recursive step of the main algorithm, there was no obvious individual hot spot that was making it slow. Since I’ve already spoiled half the solution with the premise of this post, I’ll cut to the chase and spoil the rest: in order to distinguish which case of the (conceptual) sum type syntax-case is looking at, it uses the free-identifier=? comparision procedure. This procedure is actually quite slow because it has to check all the local bindings in scope where its inputs come from; since it’s rarely used intensively – usually only a few times per macro expansion, if even that – you don’t notice how slow it is until it starts to add up from being called again and again in the middle of an exponentially-recursive algorithm.

Ending appropriately spoiled, here’s the rest of the sorry story of me trying to work out what was wrong.

I initially thought it was because the hash consing of nodes wasn’t efficient enough (Chez’s time syntax told me that a lot of GCing was happening) – so I tried to optimize that, but it made no real difference. For all its strengths, Chez Scheme’s options for debugging and profiling are pretty weak: you can ask it to annotate code for profiling when loaded, but the resulting profile will only count how many times each bit of code was evaluated, not how much time the evaluation took. It also won’t tell you many times procedures hidden behind macro expansions are called (like free-identifier=? in syntax-case); nor will it go inside any code it hasn’t annotated or can’t annotate, like the procedures within the Chez Scheme standard library (like, again, free-identifier=?). Eventually it occurred to me that, while the whole extensible-match package wouldn’t load on Guile because Guile lacks identifier property support, the decision tree generator didn’t need that support, so I could at least use its far superior profiling options (a real statistical profiler! that also looks in the code you depend on and not only your own code!) to diagnose the problem.

That only confirmed my suspicions that 8 seconds in Chez would translate into something much, much worse on other implementations: the same 14 cases of Maranget’s bytecode interpreter took over 5 minutes to generate a decision tree on Guile! What’s more, although the profiler correctly identified free-identifier=? as the culprit, with 70% of execution time spent there, I couldn’t work out why. You see, the code does also explicitly use free-identifier=? to determine which predicate and accessor calls it can coalesce into a single node in the output decision tree. But I couldn’t believe that was the expensive part of this case!

I can’t remember what made me realize that it might be the free-identifier=? that was internal to the syntax-case form that discriminated between different types of patterns that might be issue, but since free-identifier=? seemed to be the culprit, I started to work on getting it out of the main decision tree generation procedure entirely: not only the explicit uses, but the implicit use in syntax-case as well. I had a very painful refactoring session adding a new layer to the representation of patterns, converting their representation from syntax objects to real Scheme records. Free-identifier=? during decision tree generation was replaced by interning patterns’ embedded expressions into a table so instances of the same identifier, which could be coalesced into a single node in the output decision tree, could be checked by a simple integer comparison instead of free-identifier=?’s recursive scope analysis. (Building this table has quadratic complexity since Scheme has no free-identifier-hash, but it should still save time.)

This refactoring meant I could no longer use the syntax-case pattern matcher to implement the decision tree generator – but there was one nice side-effect, because I could use the opportunity to make explicit one bit of state that was only implicit in the old representation: the name of the variable which has the actual value each subpattern is being tested against.

Whoosh! That alone solved the performance problem: from 8 seconds, the decision tree generation for Maranget’s interpreter was down to 0.2 seconds in Chez. On Guile it’s still 9.3 seconds – although this can’t fairly be compared with the 314 seconds it took there before, as I have a newer and slightly faster computer since I originally ran that benchmark, and (for whatever reason) I didn’t repeat this performance test on Guile on the old computer once I’d solved the performance issue.

Ten seconds for 14 cases still isn’t great, but I’m still much less concerned about this. If you recall my mention of the similarity to a Fibonacci recursion above, it might seem that the solution is memoization. In theory, memoizing the recursion over the smaller of the two cases – the one which excludes the topmost row, which generates the side of the tree used when a test required by that row failed – might help. I haven’t tried this yet, though: based on anecdotal evidence, 14 cases is already larger than the vast majority of pattern matches out there, and 0.2 seconds for this on the main implementation extensible-match supports is pretty okay.

If I do go in and start looking for expand-time performance improvements again, another option for faster compilation might be to just split the input pattern list into chunks, and give up on optimizing the decision tree between the chunks. Since the performance characteristic is exponential and kⁿ + k^m is usually much smaller than k^n + m, this indeed seems to work: splitting these 14 cases up into 10 + 4 cases gets a decision tree in under a second on Guile – but this feels like cheating. Maybe with a compile-time option to set the chunk size or disable the cheat (so production builds get fully-optimized, fast-running decision trees while development builds get fast compilation but slower matching) … I’ll probably wait until match compilation speed on Guile isn’t quite such a shitshow for other, unrelated reasons before starting to really consider these options. (For one thing, how do or patterns count here? As mentioned above, they hide multiple cases, potentially buried deep inside a nested subpattern, and have to be unfolded to the top level to get a true count of the number of cases the decision tree generator will see.)

Anyway, the point here is the main representation of patterns is now in real Scheme records, as God intended, but there are other places in the code that still abuse syntax objects as records. Those don’t hit any performance-sensitive code, though. Probably it would be better to refactor them out as well, some day: you might notice that this isn’t the only place where records which start out as syntax objects get turned into another layer using real records (oops, bad idiom for sum types there – again, I wanted to be able to use case and not a load of type test predicates in a cond, which was really very petty of me :-( ). But that concludes today’s missive. Moral of the story: use real records in your macros when you need to; don’t try and fake them with syntax objects because you think syntax-case will be a nice pattern matcher to implement your compiler with. If you’re implementing anything other than another pattern matcher, you should have no reason to even think of abusing syntax-case for this in the first place – just use a real pattern matcher like extensible-match inside of your macro definition!

Fun fact: Although this is the first post on this new blog, it’s actually a footnote to another blog post I started to write. The footnote got … rather out of hand, and ended up being a post of its own.

Update, 3 October: Zhu Zihao asked why it is that free-identifier=? is so slow; he asked if changing the representation of ribcages in syntax objects to something like a hash table (from an alist) might help. Well, really, an alternative way to look at it isn’t so much that free-identifier=? is slow – it’s more just that normal record type checking is really fast. Since none of the record types that are actually used in representing the AST of patterns are subtyped, a successful type check involves only dereferencing the pointer to the pattern itself then checking pointer equality for the heap tag. Since I didn’t mark any of these record types as sealed, an unsuccessful check is at least another size check (size of the record type’s ancestry) and dereference (the vector of its ancestors) and pointer comparison (the type in the RTD’s place in the ancestry vector), but that’s still minimal, and pretty cache-friendly to boot. (Maybe I could squeeze an extra fraction of a fraction of a second of performance out by declaring these types sealed to avoid all that.)

By comparison, any amount of checking which of multiple bindings the identifier-abused-as-type-tag might have is always going to be slower than this. Since alists are faster than hash tables up to about 25 entries (depending on Scheme implementation, measurement technique, etc.) it’s not even clear there would be any win to be had in that direction.

Besides, another source of speed-up I didn’t mention in this post was field access. Field access with the syntax object representation involved chasing pointers, checking the wrapped list was the right length (even though it always would be, by construction) … and now, like the heap tag test, it’s always a simple, safe pointer dereference. The 70% number from Guile’s profiler gives an impression of the ratio of how much performance improvement came from getting rid of free-identifier=? and how much from not having to chase pointers down a list for field access.

8 seconds to 0.2 seconds is, actually, ‘only’ a speed-up factor of 40. While it sounds like a lot, I find it prima facie utterly believable that memory accesses and branches went down by this factor due to this change in data representation: if I think about the difference in what the computer is doing for each, it makes total sense.

Comments

Post your comment by replying to this post on Mastodon.