Balanced sampling as a tool for useful PBT random tree generation - Newton

ClojureTV · 309 views · 17 likes

Analysis Summary

10% Minimal Influence

mildmoderatesevere

“This is a highly transparent academic and technical talk; be aware that the 'problem' of non-uniform sampling is specific to complex data structures and may not apply to simpler testing scenarios.”

Transparency Transparent

Human Detected

100%

Signals

The video is a recording of a live technical presentation at a known industry conference, featuring natural human speech patterns, disfluencies, and specific academic context that AI cannot currently replicate in a live setting.

Natural Speech Disfluencies Transcript contains frequent filler words ('um', 'uh'), self-corrections ('to to'), and natural hesitations typical of live academic presentations.

Contextual Provenance Recorded at a physical conference (Clojure/Conj 2025) with a specific speaker biography and live audience interaction mentioned in the intro.

Complex Narrative Structure The speaker uses personal anecdotes about their research process and specific technical challenges that deviate from formulaic AI scripts.

Worth Noting

Positive elements

This video provides a deep dive into the intersection of formal language theory and practical software verification, specifically how to generate non-trivial regular expressions.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 13, 2026 at 16:07 UTC Model google/gemini-3-flash-preview-20251217

More on This Topic

Related content covering similar topics.

Understanding Core Clojure Functions Jonathan Graham

Zhang Jian

Minimal Transparent

clojure property based testing

Transcript

Okay, thanks everyone for coming to to take a look at my talk. Um, and especially thanks to the organizers for selecting this uh this presentation. It's somewhat uh academic which is uh looks like it's a bit unusual for this type of conference. So I appreciate everybody coming to to to take a look. So a little bit about me. I'm a research professor at a engineering school in Paris called EPITA. that we train uh uh engineers in in computer science. Um I split my time between uh between teaching and and research. A little bit of motivation about um what we're trying to do here is um in in in testing I found uh from my experience that I need to generate uh data for my for my test and generate those usually randomly. Um and there's a field called property based testing which I use quite a lot but uh what I'm going to talk about is not specifically uh tied to property based testing although it's it's quite uh common. And so basically what you need to do is you need to generate an object or select an object out of a type. A type being a set of of potential values. So given some set T of target objects uh capital T, we want to select or construct some little T in that in that set. We'd like that selection to be subjective. Sective in the sense of anything in the set could be selected. So there's nothing off limit. So we want to be able to generate anything in the set. Um um and uh somewhat controversially, we'd like the uh the selection to be uniform. So, so we would we'd like to give equal probability to any object in the set. There are some examples, right? There are trivial objects or degenerate objects such as if you're if you're generating a list, you want to make sure that you've generated the empty list in your sampling as well. But apart from from uh trivial cases, we would like some amount of of of uniformity in the selection. But this uh surge activity and uniformity in practice turn out to be difficult to to maintain both of those constraints simultaneously. So what we end up doing is constructing the object in some other set which I call s um and then mapping from s to t. So it might be easier to construct an object in set s and then a function which takes that and maps it into the into the target space. But this this um detour very often causes um nonuniformity in the target space. Um we find that very often this this technique generates trivial objects in the in the target space. Um so what we would like to do is bias our selection in the source space so that after the mapping we have better uniformity in the target space. So that's the idea and in very specifically my problem is I want to generate finite automa um and it's quite difficult to do that um in a in a uniform way but what we can do is we can construct regular expressions um which end up being just trees that are constructed and then from a regular expression convert that to a finite automa. Um but what generally happens is that uh we get finite automa corresponding to empty languages. So uh you have very large regular expressions which then correspond to to a regular expression which matches nothing. So we call this the semantic space and the syntactic space. The syntactic space is what we're constructing as a tree and then that has some meaning corresponds to some set of objects and that's the semantic space. the this set of objects which gives meaning to the trees that you're constructing. So as a overview of what I'm going to talk about um we're going to talk about well what are regular expressions and and how are they related to trees or how can they be be represented as trees? What is their connection to finite automa? I'm I'm using a special type of finite automa called symbolic deterministic finite automa. um how do we generate these uh at random and then some algorithms for biasing the generation um and it's called balanced sampling some experimental results and then some final thoughts so what are regular expressions and how are they related to abstract syntax trees as abstract syntax tree uh so when we look at a a sequence of objects especially in uh in dynamic languages um this the sequence very rarely is completely random. The person generating the sequence has some idea of what the pattern is and that pattern is often only in the head of the programmer and he writes code to to parse that in in some sort of way and very defensive code to take actions if the data is not in the format that he that he had in mind. Um what is the pattern in this case? Well, we have integers followed by one or more what I call real numbers. So, so um rational numbers or floatingoint numbers. Um so we can represent that as this regular expression of types along followed by either ratio or double and then ratio or double can be repeated or can occur one or more times with the the plus and then that concatenation of long followed by some reals can be repeated zero more times. That can be represented by this abstract syntax tree. uh a star over a cat. So it's a star over a concatenation over on one side along and the other side a union or actually a plus of a union of real and double. So regular expressions mathematically can can be thought of as these as these expression trees. Um enclosure we can denote this with an with an s expression. So keywords representing the the operators of the regular expressions star concatenation plus or and not etc. And at the leaf level of the trees we have some types which interface us to eventually to the Java type system long ratio double and so on. Um the the syntactic space is the set of such trees. So we can construct those just by putting the pieces together. The semantic space is that every such tree corresponds to a set of sequences which which match that regular expression. That set is usually infinite but could be could be uh empty. And so given two regular expressions, the intersection of the regular expressions corresponds to the intersections of the corresponding sets. Those intersections then could be empty or one set could be a subset of the other or they might have intersections that are non- nil. And those are the types of questions you want to ask mathematically given the regular expressions. What do we know about the language of the regular expressions? The the set of sequences that does match the regular expressions in closure. Uh idiomatically we don't so often talk about explicit types but we have these predicate functions built into the language. Int question ratio question float question. And so we can also in the the closure implementation of this specify these or regular expressions in terms of the predicates using this special syntax of satisfies. Um so this is what the the programmer types but programmer types in the sense of preices hands on the keyboard. Um but behind the scenes because uh closure is is has a great deal of reflection and we have this function source fund then the the implementation of the int question is actually written in terms of types. So those types can be directly inlined into the into the regular expression so that at runtime these type predicates don't have to be evaluated in many many many many many cases. In some cases the type uh predicates remain but in many cases they can be completely inlined. So the user doesn't have to think about the set of types. The user can just uh think in terms of the of the predicates. So um a RTE um which stands for um regular type expression library has a lot of operations for for evaluating regular expressions doing emptiness checks doing uh subset checks uh checking whether a sequence matches a regular expression. So such a library needs to be tested. So we need to be able to randomly generate regular expressions as input for the for the unit tests. um uh generating such a random tree is is quite straightforward given the the abstract data type of of the data. So abstract data type is you think in terms of object orientation in your brain that something is a something is a something. So uh star and not and cat and and or are operations and then we have terminal objects which represent the sigma representing the universe the universal set empty set and then epsilon is a very special set corresponding to only the empty sequence because the empty sequence turns turns out to come up very often in this in this discussion. So to generate such an object we just need a case analysis. We generate or we select one of these keys at random and then depending on if we chose type or sigma or and or not. Then we generate a regular expression recursively. For example, if we chose and then we build a list of and and then a recursive call to generate uh another regular expression of depth depth minus one. So hierarch or recursively we can generate uh any regular expression this way. So how do regular expressions correspond to finite automa which is what I said originally we're actually trying to test finite automa um so given given such a sequence um of strings floats what else do we have integers there um we want to describe that with a with a regular expression as I have here longdouble etc etc programmatically how you manage this you have to generate finite automa and I'm not going to talk about how that generation is done in in this talk but it's a it's a well-known uh process and that's done at compile time in languages like closure basically via macro expansion that can be a lengthy process but you alleviate that that overhead by using the closure macro facility and then at runtime we want to ask given a sequence does it match the expression so at runtime we have the finite autoon that was generated at macro expansion time and then we iterate across the sequence. So is 13 a long? Yes. So we move from state zero to state one. Is two a double? Yes. So we move to state two. Then we find a double an integer or actually a long a string. String string integer double double double integer and finally a double. And the sequence is at an end. and we've landed in a state with a double circle meaning it's an accepted uh sequence. So we have we have verified that this sequence is a member of the language of that regular expression. If anything else happens then the sequence is rejected. If there ever was a type which didn't match one of the transition labels, then we reject the sequence. And if we finish the the sequence not in a so-called accepting state, then the sequence is rejected. Then the sequence did not match the record expression. What's interesting about that process is that it's um has comp linear complexity that regardless of the complexity of the regular expression and I mean really a huge regular expression to to figure out whether a sequence matches it or not takes linear time. So we only have to traverse the the sequence once and that's quite surprising that there is no backtracking there is no exponential complexity. It's really a linear uh um complexity question as long as the finite automaton is what we call deterministic. So in this case uh if we encounter the the element three well three is both the long and it's a number. So we don't know which branch to take. So if if your implementation is in terms of what we call non-deterministic finite automa then the membership query turns out to be of exponential complexity. We avoid that by determinizing the state machine so that um the intersect or that no um type transitions intersect. So we build the the type long and the type number and not long. So then no object is an element of both of those types and as long as that um that uh condition is met then then checking at runtime is linear and it's it's consequently very fast. >> [clears throat] >> um how can that how is that applicable to to closure programmers? Well, a simple example of that is this destructuring case that I've implemented on top of the the um the platform here. So given some input sequence if it matches this pattern then do this rename file. If it matches this pattern delete file if it matches this pattern copy file but we can also incorporate type information into this. So if a is um for the first case if a is a boolean and b is either a string or a boolean then rename file otherwise if the second pattern matches then uh delete file and so on. So what if we have code like this that in the very last case uh we we don't rename file, we don't delete file, we don't copy file, but we launch the missiles. We would very much like to know is that code reachable. Um and we would like to be able to prove that no matter what the sequence is, that code will not be reached. So I told people we can attach this code to to to Donald Trump's uh Twitter feed and we very much know that that his Twitter feed is is not uh dangerous in the end. So what happens is that we can convert that set of pattern matches or set of of regular expressions into a finite automaton and we see that the launch of the missiles uh branch is never met and and we can prove that uh not heristically but but explicitly and in the macro expansion we actually emit a warning saying we have unreachable code in the in the uh in the macro expansion. So again we have a um a finite automaton library that does lots of lots of things these these various macro expansions but also building finite automa and checking subsetness and intersectionness and everything you want to do with finite autom test this so we need to generate finite automaton randomly as input to our to our test suites and doing so is turns out to be quite hard actually I don't know how to do that in a good way. Um, but the way that we decided to do it and the way that's very common is you generate the regular expressions randomly and then build a finite automata from that. This is sort of an a a a very optimistic example generating such a finite automa such a abstract syntax tree on the left and then the very nice finite automaton that generates from that um on the right. That's atypical. The normal thing that happens is you have a huge uh abstract syntax tree which resolves to a trivial finite autom. This happens most of the time. We would like to avoid that because it's not a very good test obviously for the for the um finite automaton library. The problem is that we have so-called annihilators in our in our algebra. So for example the and and is the intersection of two sets. It's very often that two sets have no intersection and then intersection returns uh the empty set and then when we ever deal with the empty set above that it stays empty. So large portions of the tree which which were which were generated are basically annihilated by this intersection. So all of the work that went into generating those sub trees turns out to be useless. Turns out to have been useless. Um the intersection. Yes. So the the work generating that is both in vain and we failed to test the the finite automaton library that we that we're intending to uh to test. The um in general the fix for this is we want to generate trees that are what I call landscape. So they're wide and shallow as opposed to portrait which are narrow and deep. This is the rule of thumb. The takeaway from the from the uh discussion is if you generate trees that look like this, you're more likely to have interesting data than if you have trees that look like this. Yes, we would like to generate um uh landscape rather than we must generate portrait. No, we want to generate landscape. So, the slide is completely backwards. Um, we want to generate trees that are wide and shallow. I read that slide a thousand times. I never noticed the error there. Um, so what are some AL algorithms for doing this? Um, so one is a is called a tree spplit method. We want to generate a tree for example with 10 leaves. So we split 10 into two parts three and seven. And then we generate two sub trees, one with three leaves and one with seven leaves. Generate seven leaves. We split seven into two pieces. Two and five, uh, one and two. We keep splitting until we have our our tree generated. So then we generated a random tree with a given number of of leaf nodes. Um, the question is, well, how do we split? Uh, what kind of of pivot function can we use as a general splitting function? And there are several possibilities. We could have a Gaussian split, one that tries to split close to the middle but could split anywhere because we do want we do want uh surjectivity. We could have an inverse Gaussian, one that always splits toward the edges, always tries to give an imbalanced split. Linear, which splits anywhere along from one to n with with no uh no bias. And then there's one that I use for testing purposes which I call a comb which always uh splits at the edge just for a control case for the for the testing. The um the code looks something like this that um to generate a certain number of leaves then we choose a random number between one and 100 and in with 30% likelihood or 20% likelihood we generate uh these these sub trees. For example, 20% of the time we generate a concatenation of left and right. And left is just a recursive call on the left and the right that preserves the um the requested number of of total leaves. But we've already called the the the pivot function to give us the uh the partitioning. And then there's a a function that calls split tree giving it this lopsided pivot function, the linear pivot function, the Gaussian pivot function, and the inverse Gaussian pivot function. So the amount of code to do that is actually is actually fairly fairly small. Um the other algorithm uh which is quite interesting is called the uh flag algorithm is named after a Belgian computer scientist I forget his first name. And the way that works is you want to generate a tree with with a certain number of nodes. So you generate a a list and shuffle the list and then just insert that shuffled list in order into a binary search tree. So we start with five and we insert four and we insert two we insert eight which is greater than five. We insert seven and nine is greater than eight. Uh we keep inserting and inserting until the list is finished and then we have a tree. So we could have generated any tree but on average we've generated a balanced tree. So this is called the flagagile algorithm. We do this uh first as a pre-process to generate a skeleton and then using that skeleton we fit the regular expression around that skeleton. So the code looks something like this. We we shuffle the range of of one minus the the set of leaves. We insert that into a binary tree by calling by calling reduce starting with an empty tree. Then once we have the skeleton then we call this 2 RTE on the skeleton which then uses the the shape the topology of that to um to generate uh to generate a tree again with a certain set of of probabilities. So a very quick uh demo of of um of how this uh this works. I say quick I do mean very quick. So it's generated several trees. So so this one actually looks quite quite interesting. The red and the blue represent where these annihilations have occurred. So from that uh tree this finite autominal was generated. So this is actually I consider very nice that that that we have some structure in there. In this case we have um quite a large tree which reduc reduces to quite a small finite autominal. So this is not very good. We have a lot of loage don't have a lot of retention in this particular case. Uh here's another case. This is using the flag algorithm which reduces to again a quite interesting uh interesting finite autom. Uh this is the comb algorithm which is intentionally lopsided for testing purposes and that very large tree uh generates a trivial finite automaton. Um in this case this again this is the linear algorithm which is generated uh well somewhat of a letdown right because have a lot of information here and then not much information there so on. Okay. Uh so the thing that usually occurs in all cases is that uh you're depressed by looking at the size of the the final automa that are that are generated. So um we'd like to do some measurements which of these uh algorithms gives the best results. So the thing that we do is is I I ran a huge simulation generating a lot of these trees with the different algorithms count the number of nodes build a finite automan count the number of states um we also measure what I call the aspect ratio which I'll show later the the shape of the trees so the landscapeness uh portrait is is measured and um and then from there Um uh here are some several pieces of results. So the different colors represent the different algorithms. Um 45 37% of the cases finite automa of size two were generated. So there's always an invisible state that you don't see. That's why we don't see a size one. And potentially the most interesting part of this is the number of of finite automa of size 10 or more. So the the flagagulate algorithm has the best results of generating trees that are that are that are large. However, it has the worst results of it generates the trivial tree most often. So it's a bit a bit of a of a of a contradiction there. Um to measure the the aspect ratio, we divide the longest path by the log of the number of leaves because the the the depth and the number of leaves is in an exponential logarithmic relation. So one means square, less than one means landscape, greater than one means portrait. So if we uh here's one dot for each simulation. So the size of the of the tree is measured on the y- axis and the aspect ratio on the x- axis. Um and so you can't really look at that and um and get a good picture, although you do see some a lot of purple at the top there. So I do what I call a rat's nets. I just connect the dots of the same color randomly. So you get an idea of where the the heart of the simulations occur and then how many outliers exist. So in this case the the linear algorithm looks like that. The the Gaussian split algorithm looks like this which is quite good. Everything is is fairly low aspect ratio and you have an interesting set of outliers. the um inverse Gaussium the um the flagagillate algorithm also has has quite low aspect ratio and quite a lot of outliers. So the outliers are what you want you want these these special cases because normally as I said the finite automa are uninteresting. You have to run a lot of simulations to get interesting cases. Um the um the second graph there is the the percentage of finite automa which are beyond a certain size. So um when we look at the at 100 so how many finite automa of size 100 or larger were generated by each of the algorithms and we see that for example the purple algorithm which is the linear split generated greater than 100 states 1% of the time. Okay. And so in that we see that the flagagile algorithm, the orange one is pretty much the the best although the the amount that is better by is not really substantial. It's marginally better. And we also see this sort of asmtoic behavior that I'm not sure if it's real or just the fact that we didn't generate a lot of data that they do seem to converge for very very very large cases which which occur rarely. Um what we have here is a measure of the of the loss versus the retention. So um the the uh linear algorithm whereas the flag algorithm flag algorithm has has quite a lot of retention. So retention meaning that we had a a a small a which generated a large DFA as opposed to a large a which generated a small DFA. Okay. So, in in summary, um um we didn't get the actual results that that we thought. The papers we were using predicted a much better results from the balancing algorithms. And exactly why that's the case, it's still unclear. Although we do somewhat confirm the tendency, but uh it doesn't uh deserve a Nobel Prize. um perhaps the the effect of these annihilators on our algebra are different than than what were in the u the papers that we were that we were looking at. So in general the um the balancing approach does give you better results and the amount of work you have to do to get balanced trees is in fact not very much. So you have a a a small amount of of work for a a payoff but but not really a a huge huge huge payoff. there's not really any any uh motivation for avoiding this this approach. Um we have um the project um uh we have versions of the project in common list closure python and um an enclosure in um Scala and those implementations are available uh from those GitHub results. You can reach me at uh these addresses and I really welcome your feedback and I especially welcome critical feedback because science advances when people find problems in your work not when they say oh this this is this is great and then and then it's unused. So I thank you very much for for your time. [applause] I'm happy to take questions if anyone has it either formally or or in the hallways. >> Oh yeah, we don't have time for question. Oh, go ahead. We got time for one question. Your question, sir. >> Thank you so much for the talk and sharing these results. I was curious if any work had been done. You mentioned a lot about the annihilators. Any work in terms of how you bias the placement of the anal of the annihilators in the tree? You talked about sort of forming them around the skeleton, but I'm you think about placing annihilators. I I haven't I haven't um I haven't seen that formal result but as you look at the at the graphs uh you see that that when the annihilators happen high on the graph the consequence of the annihilation is severe and when the annihilation happens low on the graph it's you don't lose a lot of work. So if you could bias it such that that at the top level of your recursion you have the annihilators are less often then you could potentially get get better results. Yes. >> All right. One more round of applause for Jim New. [applause]

Video description

Have you ever tried to randomly generate abstract syntax trees (ASTs) for property-based testing? If so, you’ve probably run into this phenomenon: your generator seems to work fine, but most of the results are boring or meaningless. A 2021 paper by Florent Koechlin and Pablo Rotondo offers a new way to fix this. It’s called BST-like sampling, and it biases the kinds of trees you generate. Why does that matter? Because if you pick trees uniformly, giving each one the same chance, you often end up with misleading biases: most of your regular expressions might match nothing, arithmetic expressions often reduce to zero, Boolean expressions usually simplify to just True or False. We’ll show how we hit this exact problem when generating regular expressions. At first, our test cases looked okay, but when we dug deeper, most of them boiled down to trivial patterns that weren’t useful. After starting using BST-like sampling, the results got much better. Trivial cases showed up less often, leading to more interesting and diverse test cases. If you care about better property-based testing, especially for symbolic or structured data, come check it out. This technique might save you a lot of time and frustration. Biography Assistant Research Professor at EPITA (School of Engineering and Computer Science). 37 years as Lisp programmer, 5 years in Clojure. Recorded Nov 14, 2025 at Clojure/Conj 2025 in Charlotte, NC.