From program synthesis to Rényi entropy

This is the story of how a conjecture we had been working on became Problem 2 of the International Mathematical Olympiad in 2020. Post jointly written with Guillaume Lagarde.

Searching for programs

Program synthesis is the task of automatically finding a program that satisfies a specification, typically given as input-output examples. Modern approaches use machine learning models (neural networks, large language models, or probabilistic grammars) to assign a probability to each program. The model reflects the prior belief that a given program is the right answer. The details of how this distribution is learned do not matter for what follows. What matters is that we have a probability distribution over programs.

In our AAAI 2022 paper, we introduced a framework called distribution-based search to study how such a distribution should guide the search. We asked two questions: what is the best a deterministic algorithm can do, and what is the best a randomized algorithm can do?

Setup

There are in general infinitely many programs, but for a concrete analysis one can restrict to programs up to a given size, a standard move in program synthesis, since any target expressible in a grammar can be found within a finite size bound. We treat both cases: the finite case makes the combinatorics cleaner and is the natural setting for bounded synthesis; the infinite case requires the sums below to converge, which holds whenever the distribution has finite expected rank.

Label programs so that $p_1 \geq p_2 \geq \cdots \geq p_M$ (or $p_1 \geq p_2 \geq \cdots$ in the infinite case), where $p_i$ is the probability of the $i$ -th most likely program. Assume the target program is itself drawn from this distribution: the target is program $i$ with probability $p_i$ .

A deterministic algorithm tests programs in some fixed order; a randomized one samples programs from some distribution. In both cases, the performance measure is the expected number of programs tested before finding the target.

For the deterministic case, it is not hard to show that the optimal strategy is to test programs in decreasing order of probability. The expected number of guesses for this strategy is:
$E[G] = \sum_{i=1}^{M} i \cdot p_i.$

For the randomized case, the algorithm samples programs independently from some distribution $q = (q_1, \ldots, q_M)$ until the target is found. The target $i$ is found after a geometric number of trials with parameter $q_i$ , so the expected number of trials is:
$E[T] = \sum_{i=1}^{M} p_i \cdot \frac{1}{q_i}.$
The question is how to choose $q$ to minimize $E[T]$ .

An example

Consider the distribution over the natural numbers defined by $p_n = 1/2^{n+1}$ for $n = 0, 1, 2, \ldots$ This is the infinite case: there are programs of every size, and the distribution assigns probability $1/2$ to program $0$ , probability $1/4$ to program $1$ , and so on.

Three algorithms are worth comparing.

Deterministic enumeration ( $A_1$ ): test programs $0, 1, 2, \ldots$ in order. The expected number of guesses is:
$E[G] = \sum_{n \geq 0} (n+1) \cdot \frac{1}{2^{n+1}} = 2.$
This is the optimal deterministic algorithm, and its cost is just $2$ .

Naive sampling ( $A_2$ ): sample programs from the distribution $p$ itself, i.e., draw $n$ with probability $1/2^{n+1}$ . The expected number of draws to hit the target $n$ is the mean of the geometric distribution with parameter $p_n = 1/2^{n+1}$ , which is $2^{n+1}$ . Averaging over the target:
$E[T] = \sum_{n \geq 0} \frac{1}{2^{n+1}} \cdot 2^{n+1} = \sum_{n \geq 0} 1 = \infty.$
Naive sampling has infinite expected cost. Sampling from the distribution you are searching over is not a good idea.

SQRT sampling ( $A_3$ ): sample programs with probability proportional to $\sqrt{p_n} = 1/2^{(n+1)/2}$ . The normalisation constant is $\sum_{n \geq 0} 1/2^{(n+1)/2} = 1 + \sqrt{2}$ , so:
$q_n = \frac{1}{1 + \sqrt{2}} \cdot \frac{1}{2^{(n+1)/2}}.$
The expected number of draws to hit target $n$ is $1/q_n = (1+\sqrt{2}) \cdot 2^{(n+1)/2}$ . Averaging over the target:
$E[T] = \sum_{n \geq 0} \frac{1}{2^{n+1}} \cdot (1+\sqrt{2}) \cdot 2^{(n+1)/2} = (1+\sqrt{2}) \sum_{n \geq 0} \frac{1}{2^{(n+1)/2}} = (1+\sqrt{2})^2 \approx 5.83.$

The deterministic algorithm costs $2$ . Naive sampling costs $\infty$ . SQRT sampling costs about $5.83$ : only about three times worse than the deterministic optimum, despite requiring no memory and no ordering. As we discuss at the end, this trade-off becomes very attractive in the parallel setting.

The case of deterministic search

The natural information-theoretic measure of uncertainty in a distribution is the Shannon entropy:

$H = -\sum_{i=1}^{M} p_i \log_2 p_i.$

The quantity $2^H$ is called the perplexity. We initially tried to link $E[G]$ directly to the perplexity. The intuition was natural: a higher-entropy distribution is more spread out and harder to search, so the perplexity ought to be connected to the expected waiting time, whether as an upper bound, a lower bound, or both.

In the geometric example above, $H = 2$ bits so the perplexity is $2^H = 4$ , while $E[G] = 2$ : the perplexity is within a factor of two of the true cost. That felt like a detail.

The Olympiads

We consulted Stijn Cambie, a very strong mathematician connected to us through a colleague at the time. He quickly showed that the perplexity is not an upper bound on $E[G]$ : there exist distributions where $E[G] > 2^H$ . However, during this process he also identified a specific case with $M = 4$ where the opposite inequality $E[G] \leq 2^H$ holds and is essentially tight. He proposed this as Problem 2 of the 2020 International Mathematical Olympiad, where it appears in a different but equivalent combinatorial language.

There is nonetheless a genuine connection between $E[G]$ and the perplexity. As we later discovered, James Massey had proved in 1994 in “Guessing and Entropy” that the perplexity is always a lower bound up to a constant factor:
$E[G] \geq \frac{1}{4} \cdot 2^H.$
So the perplexity is connected to $E[G]$ , but only from below. For an upper bound, one needs a different quantity.

The case of randomized search

For the randomized case, the optimal distribution $q$ can be found directly. We want to minimize $E[T] = \sum_i p_i / q_i$ over all probability distributions $q$ . Apply Cauchy-Schwarz’s inequality to the vectors $(\sqrt{q_i})_i$ and $(\sqrt{p_i / q_i})_i$ :

$\left(\sum_{i=1}^{M} \sqrt{p_i}\right)^2 = \left(\sum_{i=1}^{M} \sqrt{q_i} \cdot \sqrt{\frac{p_i}{q_i}}\right)^2 \leq \left(\sum_{i=1}^{M} q_i\right)!\left(\sum_{i=1}^{M} \frac{p_i}{q_i}\right) = E[T].$

So $E[T] \geq \left(\sum_i \sqrt{p_i}\right)^2$ for any choice of $q$ . Equality holds in Cauchy-Schwarz when $\sqrt{q_i}$ is proportional to $\sqrt{p_i/q_i}$ , which gives $q_i \propto \sqrt{p_i}$ .

The optimal randomized algorithm therefore samples each program $i$ with probability proportional to $\sqrt{p_i}$ ; we called this SQRT sampling. Its expected number of trials is:
$E[T] = \sum_{i=1}^{M} p_i \cdot \frac{\sum_j \sqrt{p_j}}{\sqrt{p_i}} = \left(\sum_{i=1}^{M} \sqrt{p_i}\right)^2.$

Quite surprisingly, the optimal randomized algorithm admits an exact answer, and it involves square roots of probabilities.

The Rényi Entropy of Order 1/2

The quantity $\left(\sum_i \sqrt{p_i}\right)^2$ that governs randomized search is an instance of the Rényi entropy family. The Rényi entropy of order $\alpha$ is:
$H_\alpha(X) = \frac{1}{1-\alpha} \log_2!\left(\sum_{i=1}^{M} p_i^\alpha\right).$
At $\alpha \to 1$ this recovers the Shannon entropy $H$ . For $\alpha = 1/2$ :
$2^{H_{1/2}(X)} = \left(\sum_{i=1}^{M} \sqrt{p_i}\right)^2.$

The optimal randomized algorithm achieves exactly $E[T] = 2^{H_{1/2}(X)}$ . What about the deterministic one? The following bounds hold:

$\frac{2^{H_{1/2}(X)}}{1 + \ln M} \;\leq\; E[G] \;\leq\; 2^{H_{1/2}(X)}.$

Upper bound. Since probabilities are sorted in decreasing order, $p_j \geq p_i$ whenever $j \leq i$ . Therefore $\sum_{j=1}^{i} \sqrt{p_j} \geq i\sqrt{p_i}$ , giving $i \leq \left(\sum_{j=1}^M \sqrt{p_j}\right) / \sqrt{p_i}$ . Multiply both sides by $p_i$ and sum over $i$ :
$E[G] = \sum_{i=1}^{M} i \cdot p_i \;\leq\; \sum_{i=1}^{M} \left(\sum_{j=1}^{M} \sqrt{p_j}\right)\sqrt{p_i} = \left(\sum_{i=1}^{M} \sqrt{p_i}\right)^2 = 2^{H_{1/2}(X)}.$

Lower bound. Write $\sqrt{p_i} = \sqrt{i \cdot p_i} \cdot \tfrac{1}{\sqrt{i}}$ and apply Cauchy-Schwarz:
$\left(\sum_{i=1}^{M} \sqrt{p_i}\right)^2 \leq \left(\sum_{i=1}^{M} i \cdot p_i\right)!\left(\sum_{i=1}^{M} \frac{1}{i}\right) \leq E[G] \cdot (1 + \ln M).$

Rearranging gives $E[G] \geq 2^{H_{1/2}(X)} / (1 + \ln M)$ .

The proofs fit on half a page. The upper bound says the deterministic algorithm is at least as good as the randomized one: $E[G] \leq E[T] = 2^{H_{1/2}(X)}$ . The lower bound says it cannot be better than $2^{H_{1/2}(X)}$ by more than a logarithmic factor.

The Rényi entropy of order $1/2$ is the right measure of difficulty for this problem. It governs both the deterministic and the randomized optimal algorithms, and it emerges from Cauchy-Schwarz applied in two different ways.

Beyond sequential search: parallel algorithms

The sequential setting where one program is tested at a time is the right abstraction for a single CPU core, but modern hardware changes the picture. With hundreds of cores or thousands of GPU threads running simultaneously, the relevant question is no longer how many programs are tested in total, but how many parallel rounds are needed before finding the target.

The two paradigms (deterministic and randomized) behave very differently here. For deterministic algorithms, the order in which programs are tested depends on their ranks, and there is no natural way to distribute the enumeration across independent workers without coordination. Sampling, by contrast, is embarrassingly parallel: each worker independently draws a program from the distribution $q_i \propto \sqrt{p_i}$ and checks it, with no communication needed between workers. Doubling the number of workers halves the number of rounds.

This difference in parallelizability is one reason randomized search becomes more attractive in practice, even when the sequential expected costs are comparable. But taking full advantage of massively parallel hardware, and in particular GPUs, which expose thousands of threads, requires more than just running SQRT sampling in parallel. GPU threads must stay synchronized, avoid branching, and work without shared mutable state; the architecture rewards algorithms designed around these constraints from the ground up.

In our recent paper, we pursued this direction for a concrete and demanding synthesis domain: Mixed-Boolean Arithmetic expressions. We introduced SIMBA, a cache-free enumeration algorithm built for GPU parallelism. Each GPU thread independently decodes and evaluates a complete candidate expression from its thread index. The result is an algorithm that scales to thousands of threads without coordination overhead and solves synthesis problems that were out of reach for prior methods.

Understanding how Transformers solve Sudokus

Read post