GPU-accelerated program synthesis: from LTL learning to mixed-Boolean arithmetic

A non-technical introduction to a line of research on GPU-accelerated program synthesis, and in particular the recent paper "GPU-Accelerated Synthesis of Mixed-Boolean Arithmetic: Beyond Caching" by Gabriel Bathie, Baptiste Mouillon, and myself.

Program synthesis is the problem of automatically generating a program from a specification. In its simplest form, the specification is a set of input-output examples: given pairs of inputs and expected outputs, find a program that reproduces that behaviour on all of them. Researchers have been working on this problem since the 1960s, and it remains a central challenge in computer science.

The difficulty is fundamental: finding a correct program requires searching through a space of candidates that grows combinatorially with program size. Programs are built from components (variables, operators, constants), and the number of possible programs grows very rapidly as those components combine. Even a short expression like (x AND y) + z involves choices of operators, nesting structure, and variable placement. Exhaustive enumeration quickly becomes impractical as target complexity increases.

Enter GPUs: parallelism at scale

GPUs (graphics processing units) were originally designed to render pixels. Because a screen has millions of pixels that all need updating at once, GPUs were built around a different principle than CPUs: instead of a small number of very fast processors, they use thousands of simpler ones working in parallel.

This architecture proved useful far beyond graphics. Deep learning adopted GPUs in the 2010s, and the results were impressive: training times fell from weeks to hours, and models scaled to previously impractical sizes. The reason is structural. Neural network training decomposes naturally into many independent computations that can proceed simultaneously, which is exactly what GPU hardware is designed for.

A natural question follows: can program synthesis also benefit from this kind of parallelism?

First attempt: learning logic formulas on GPUs

Our first answer came in a 2024 paper: LTL Learning on GPUs.

The target was Linear Temporal Logic (LTL), a formal language for specifying how systems should behave over time. LTL is widely used in industrial software verification, covering safety properties like “the door must never open while the engine is running.” Learning an LTL formula from examples (positive traces that should be accepted, negative ones that should be rejected) is a non-trivial synthesis problem.

The key algorithmic idea is bottom-up enumeration: start from atomic building blocks, then systematically combine them into larger and larger formulas, checking each candidate against the specification. It’s a brute-force search, but a principled one: you’re guaranteed to find the smallest formula that fits the data.

The challenge is that bottom-up enumeration produces a lot of redundancy. Many syntactically different formulas behave identically: A AND B and B AND A do the same thing, as do countless other variations. Exploring all of them is wasteful.

The solution the paper introduced was to cache formulas by their behavior, not their syntax. Two formulas that produce identical outputs on all test inputs are grouped together, and you only need to keep one representative from each group. This is called observational equivalence, and it dramatically prunes the search space.

Running this pruned search on a GPU, with thousands of threads each evaluating candidates in parallel, produced substantial improvements: the GPU-based learner handled at least 2048 times more traces than the previous state of the art, and ran at least 46 times faster on average.

The wall: when caching breaks down

This caching strategy works well when program behaviors are qualitative — when the output of a program on any given input is a simple yes or no (or true or false). LTL formulas do exactly this: they accept or reject a trace. Two formulas that agree on all test traces land in the same cache bucket.

Now consider a different kind of expression: Mixed-Boolean Arithmetic (MBA). MBA expressions combine bitwise operations (AND, OR, XOR, NOT) with ordinary arithmetic (+, −, ×) over fixed-width integers. A simple example: (x XOR y) + 2 * (x AND y). This turns out to equal x + y, a fact exploited by obfuscation tools, which transform readable code into semantically equivalent but unrecognizable tangles of MBA. Reversing this transformation, taking the tangled expression and recovering the simple one, is an MBA synthesis problem with real security implications.

MBA expressions do not produce Boolean outputs, they produce numbers: for 32-bit integers, each expression returns one 32-bit value per input. Two expressions that differ on even a single input land in different cache buckets. With 16 test inputs of 32-bit values, the behavioral space has size 2^{32 times 16} = 2^{512}, which is far beyond any feasible cache. Observational equivalence classes are almost always singletons, so the deduplication that makes caching effective in qualitative domains such that LTL simply does not apply here. The search space remains exponentially large, with no practical way to prune it through caching. A direct port of the LTL approach degrades rapidly on MBA instances of even modest size.

The new idea: go cache-free entirely

In this paper, we introduce SIMBA (Synthesis of Mixed-Boolean Arithmetic), taking a different approach: rather than attempting to make caching work in this quantitative setting, we abandon the cache entirely.

At first, removing the cache seems to give up the main source of efficiency. In the LTL setting, the cache serves two purposes: it prunes the search by grouping observationally equivalent candidates, and it stores intermediate sub-expressions for constructing larger ones. In the MBA setting, the pruning benefit is negligible (since equivalence classes are essentially singletons). But the storage role can also be eliminated, given the right algorithm design, and this is where the GPU’s parallelism becomes decisive.

The idea: instead of building up a library of sub-expressions, enumerate every candidate expression independently from scratch. Each GPU thread receives a single integer (its thread ID), and from that integer alone decodes a complete MBA expression, evaluates it against the target specification, and immediately discards it. No storage, no communication between threads, no cache.

The key ingredient is the decoding step. We define a bijection between integers and valid MBA expressions of a given size. Thread 0 gets expression 0 (decoded to, say, x0), thread 1 gets expression 1 (decoded to x1), thread 100 gets expression 100 (decoded to something more complex), and so on. Every expression of a given size gets exactly one thread. Together, the threads exhaustively cover all possibilities in parallel.

This bijection is designed with one more constraint: nearby integers decode to structurally similar expressions. GPU hardware executes groups of 32 threads in lockstep, in what is called a “warp”, and performance degrades sharply if threads in the same warp take different code paths (this is called branch divergence). By ensuring that consecutive integers produce structurally similar expressions, threads in the same warp follow nearly the same execution path, keeping the GPU busy and efficient.

What this looks like in practice

Concretely: SIMBA represents expressions in a compact notation (Reverse Polish Notation), precomputes a small table counting how many valid expressions exist at each size, and uses that table to implement the integer-to-expression decoding efficiently in thread-local registers, with no slow memory accesses needed.

At each size level, SIMBA launches enough GPU threads to cover every possible expression of that size. Each thread decodes its expression, evaluates it on all test inputs, compares the results to the target, and reports success if they match. If no thread finds a match, move to the next size and repeat.

The result is a synthesizer that is:

Parallel by design: threads share no state and need no coordination.
Memory-light: no cache means no memory pressure, regardless of how many variables or test inputs the problem has.
Sound and complete: every expression of every size is eventually enumerated, so if a solution exists, it will be found.

How much better is it?

The experiments compare SIMBA against two baselines: VFB (the direct GPU port of the LTL caching approach applied to MBA) and cvc5 (a state-of-the-art CPU-based SMT solver).

We built a new benchmark suite of 2340 MBA synthesis problems, spanning a wide range of sizes and numbers of variables. (The existing standard benchmark turned out to be too easy: SIMBA solved 90% of it in under a second.)

The results:

SIMBA solves nearly all instances up to size 11, and is the only algorithm that can solve instances of size 12–16. VFB and cvc5 both stall around size 6–7.
Against VFB: on the 615 instances both solved, SIMBA is faster on 90% of them, with a median speedup of 1.6×. The remaining 634 instances SIMBA solved are entirely beyond VFB’s reach.
Against cvc5: on shared instances, SIMBA finishes in under one second on 96% of them. The 668 instances only SIMBA solved are beyond cvc5’s time limit.

The gap is substantial. Cache-based methods hit a size limit beyond which they cannot solve instances within the time budget, while SIMBA continues to succeed at sizes that were previously out of reach.

The broader story

Stepping back, this line of work tells a coherent story about GPUs and program synthesis. GPUs do not reduce the fundamental difficulty of program synthesis. What they offer is a specific kind of resource: thousands of cores that can work simultaneously, provided the algorithm is designed to use them without introducing bottlenecks. Deep learning benefits from GPUs because its core computations decompose naturally into independent parallel operations. Synthesis benefits for the same reason, but only when the algorithm avoids data structures or communication patterns that force threads to wait for each other.

The LTL work showed that semantic caching is one good design. The MBA work shows that when caching itself is the bottleneck, the right move is not to improve the cache but to eliminate it. The principle is the same: design around the GPU’s strengths.

MBA is one domain where outputs are quantitative and caching fails, but it is not the only one. Similar challenges arise in synthesis with numerical constants, synthesis over richer data types, and verification of probabilistic systems. The cache-free paradigm that SIMBA demonstrates is a candidate approach for all such settings.

The SIMBA paper is available at https://arxiv.org/abs/2605.08243. The LTL learning work is at https://arxiv.org/abs/2402.12373 and https://arxiv.org/abs/2504.18943.

Understanding how Transformers solve Sudokus

Read post

A gentle introduction to Mechanistic Interpretability

Read post

Value iteration for parity games

Read post

See all posts

See all posts in category