Why is prompting hard? Understanding prompts on binary sequence predictors

Anian Ruoss; Jordi Grau-Moya; Li Kevin Wenliang; Marcus Hutter; Tim Genewein

arxiv: 2502.10760 · v2 · submitted 2025-02-15 · 💻 cs.CL · cs.LG· stat.ML

Why is prompting hard? Understanding prompts on binary sequence predictors

Li Kevin Wenliang , Anian Ruoss , Jordi Grau-Moya , Marcus Hutter , Tim Genewein This is my paper

Pith reviewed 2026-05-23 02:35 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML

keywords promptingconditioning sequencessequence predictionpretraining distributionbinary predictorsin-context learningfrontier models

0 comments

The pith

Optimal conditioning sequences for binary sequence predictors are often unintuitive and explained better by the pretraining distribution than by the target task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats prompting as the problem of selecting the best conditioning sequence for a near-optimal sequence predictor. In controlled experiments with binary sequence predictors whose pretraining distributions are fully known, optimal prompts frequently appear unrelated to the downstream task and outperform standard choices such as task-specific demonstrations. Even exhaustive search over conditioning sequences fails to make optimal prompt identification reliable. The same empirical setup applied to frontier models yields analogous patterns. The pretraining distribution therefore supplies the key explanatory lens for why certain prompts succeed.

Core claim

We view prompting as finding the best conditioning sequence on a near-optimal sequence predictor. On numerous well-controlled experiments, we show that unintuitive optimal conditioning sequences can be better understood given the pretraining distribution, which is not usually available. Even using exhaustive search, reliably identifying optimal prompts for practical neural predictors can be surprisingly difficult. Popular prompting methods, such as using demonstrations from the targeted task, can be surprisingly suboptimal. Using the same empirical framework, we analyze optimal prompts on frontier models, revealing patterns similar to the binary examples and previous findings.

What carries the argument

The framing of prompting as search for the best conditioning sequence on a near-optimal sequence predictor, tested via binary sequence predictors with known pretraining distributions.

If this is right

Popular methods that rely on task demonstrations can remain suboptimal even after exhaustive search over alternatives.
The pretraining distribution supplies a systematic way to predict which conditioning sequences will perform well.
Reliable identification of optimal prompts stays difficult for predictors that are only approximately optimal.
Patterns observed on binary predictors appear again when the same search procedure is run on frontier models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pretraining distribution governs optimal conditioning in simple cases, prompt design for large models may benefit from explicit modeling of training-data statistics rather than task-specific heuristics.
The difficulty of exhaustive search suggests that future prompt-finding algorithms should incorporate distributional priors instead of treating the search as unstructured.
The binary-predictor setup isolates the effect of conditioning from other model behaviors, offering a test bed for theories of in-context learning that focus on sequence statistics.

Load-bearing premise

Binary sequence predictors trained to near-optimality on fully known distributions behave like the conditioning mechanisms inside practical neural predictors and frontier models.

What would settle it

A binary sequence experiment in which the empirically optimal conditioning sequence is not the one predicted by the pretraining distribution or is consistently beaten by task demonstrations.

Figures

Figures reproduced from arXiv: 2502.10760 by Anian Ruoss, Jordi Grau-Moya, Li Kevin Wenliang, Marcus Hutter, Tim Genewein.

**Figure 2.** Figure 2: Results for a pretraining DG p = BernMix(0.2, 0.7) and a task DG q = Bern(0.7). Left, the proportion correct for Bayes predictor (103 seeds per data point) and two neural predictors (30 seeds per data point). Error bar show 1 SEM. The black dotted line is the theoretical value for T = 1 (see Appendix B.1.2). Additional results are in Appendix B.1.1. Right, empirically optimal prompts at Lmax = 5 for the Ba… view at source ↗

**Figure 3.** Figure 3: Results for p = BernMix(0.2, 0.7) and q = Bern(0.6). Left, each circle represents the heads/tails count of the theoretically optimal prompts. The orange dotted line indicates the maximum prompt length Lmax. The green dashed line marks 60% ONEs in s ∗ . Right, the proportion correct of sˆ = s ∗ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Results for p = BetaBern(1, β) with β ∈ {1, 2}, and q = Bern(τ ) with τ ∈ {0.7, 0.9}. Left, the ratio of ONEs in the theoretical s ∗ . Red dotted line shows true bias of q. Right, proportion correct of sˆ for β = 1 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Results for a random switching pretraining DG and four switching downstream DGs with different causes [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Upper, experiment design for the bandit task. Lower, 4 equivalent theoretical [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Final pretraining loss under various DGs [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The proportion correct of the empirically optimal prompt for [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: The loss “landscape” of p = BernMix(0.2, 0.7) and q = Bern(0.7). The leftmost column shows the KL divergence of each prompt sorted in increasing order against the prompt rank (sort indices, or argsort), with lower rank meaning lower KL divergence (12). The other columns show KL divergence of each individual prompt. The prompts here are all expressed by their counts. See Appendix A.5 for a detailed explanat… view at source ↗

**Figure 10.** Figure 10: The distribution of the prompt counts for [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: The proportion correct of empirically optimal prompt for [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: The distribution of the prompt counts for [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: The loss “landscape” of p = BernMix(0.2, 0.7) and q = Bern(0.6).The leftmost column shows the KL divergence of each prompt sorted in increasing order against the prompt rank (sort indices, or argsort), with lower rank meaning lower KL divergence (12). The other columns show KL divergence of each individual prompt. The prompts here are all expressed by their counts. See Appendix A.5 for a detailed explanat… view at source ↗

**Figure 14.** Figure 14: extends the results in [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: The distribution of empirically optimal prompt for [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: (left) shows that the theoretical optimal prompt has a distinctively lower KL divergence compared to other prompts, wit hthe exception that T = 1 still has a very flat landscape. The other columns show a clearer optimal region, especially for τ = 0.7. These are in stark contract to Figs. 9 and 13 where the optimal points do not standout among other suboptimal prompts. 0 10 20 0 3 6 L max = 5, = 0.7 K L[q(… view at source ↗

**Figure 17.** Figure 17: Proprotion of empirical prompts that match the theoretically [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Proportion correct of sˆ. Same as [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Distribution of empirically optimal prompts. Here, we set the [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: The loss “landscape” of [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 21.** Figure 21: Same as Fig [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: Same as Fig [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Relationship between the return and the uniform random variable for different values of the power index [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Estimated return of each prompt by Monte Carlo against the prompt rank. The prompts are sorted using estimated [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗

**Figure 25.** Figure 25: The estimated rollout return for empirically optimized prompts. Errorbars show 1 SEM from 100 seeds. [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗

**Figure 26.** Figure 26: The distribution of Win-Stay/Lose-Shift for all empirical prompts for [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗

**Figure 27.** Figure 27: Total reward and regret of typical prompts, compared with Thompson sampling agent and the optimal prompt [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗

read the original abstract

Frontier models can be prompted or conditioned to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We view prompting as finding the best conditioning sequence on a near-optimal sequence predictor. On numerous well-controlled experiments, we show that unintuitive optimal conditioning sequences can be better understood given the pretraining distribution, which is not usually available. Even using exhaustive search, reliably identifying optimal prompts for practical neural predictors can be surprisingly difficult. Popular prompting methods, such as using demonstrations from the targeted task, can be surprisingly suboptimal. Using the same empirical framework, we analyze optimal prompts on frontier models, revealing patterns similar to the binary examples and previous findings. Taken together, this work takes an initial step towards understanding optimal prompts, from a statistical and empirical perspective that complements research on frontier models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Binary predictors give a clean statistical account of why some prompts beat task demos, but the step to frontier models stays tentative.

read the letter

The main point is that optimal conditioning sequences on these binary predictors turn out to be unintuitive and directly traceable to the pretraining distribution, which makes standard task demonstrations suboptimal even after exhaustive search. The paper frames prompting as picking the best conditioning sequence for a near-optimal predictor and then runs controlled experiments on fully known binary distributions to show this. They also check the same setup on frontier models and report matching patterns. That controlled setup is the real contribution. It lets them know the exact distribution, search every possible prompt, and tie the results back to pretraining statistics without the usual black-box issues. This moves the discussion past pure trial-and-error and gives a concrete way to think about why certain conditionings work. The experiments are well-designed for isolating the distribution effect, and the authors are clear that real pretraining distributions are rarely known. The softer part is the link to actual large models. The binary predictors are low-capacity and lack attention, so their conditioning behavior may not capture how transformers handle long context or token dependencies at scale. The abstract says similar patterns appear on frontier models, but without ablations that test whether the same statistical drivers dominate once capacity and architecture change, the explanation for practical prompting stays partly observational. If those dynamics differ, the binary results explain a simplified case more than they explain frontier-model prompting. This is for people working on the statistical side of in-context learning. It deserves peer review because the experimental approach is fresh and the binary results are reproducible on their own terms, even if the generalization step needs more work.

Referee Report

1 major / 1 minor

Summary. The paper claims that prompting is equivalent to finding the best conditioning sequence on a near-optimal sequence predictor. Through controlled experiments on binary sequence predictors trained to near-optimality on fully known distributions, it shows that optimal conditioning sequences are often unintuitive and explained by the pretraining distribution, that popular methods such as task demonstrations remain suboptimal even under exhaustive search, and that similar patterns hold when the same empirical framework is applied to frontier models.

Significance. If the results hold, the work supplies a useful statistical perspective on prompting that complements frontier-model studies by leveraging fully known distributions and exhaustive enumeration to identify suboptimality. The controlled binary-predictor setting and the reported replication of patterns on frontier models are explicit strengths that allow precise, falsifiable observations about pretraining-distribution effects.

major comments (1)

[Abstract and introduction framing; § on frontier-model experiments] The central extension from binary predictors to frontier models rests on the assumption that the conditioning dynamics are governed by the same statistical factors. The manuscript invokes this framing in the abstract and introduction but provides no ablation or quantitative comparison (e.g., of context-length scaling or token-dependency effects) between the attention-free binary models and transformer in-context behavior, leaving the representativeness claim load-bearing yet under-supported.

minor comments (1)

[Methods] Notation for conditioning sequences and pretraining distributions would benefit from an explicit running example early in the methods to improve readability for readers outside the binary-sequence setting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the strengths of the controlled binary-predictor experiments. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract and introduction framing; § on frontier-model experiments] The central extension from binary predictors to frontier models rests on the assumption that the conditioning dynamics are governed by the same statistical factors. The manuscript invokes this framing in the abstract and introduction but provides no ablation or quantitative comparison (e.g., of context-length scaling or token-dependency effects) between the attention-free binary models and transformer in-context behavior, leaving the representativeness claim load-bearing yet under-supported.

Authors: We agree that the current framing in the abstract and introduction could more precisely delineate the scope of the frontier-model analysis. The manuscript applies the same empirical procedure (exhaustive or targeted search over conditioning sequences) and reports qualitatively similar patterns, but does not assert that the underlying statistical factors or scaling behaviors are identical across architectures. The binary setting supplies known distributions and exhaustive enumeration; the frontier-model results are presented as an existence check that the observed phenomena are not artifacts of the simplified model class. To address the concern, we will revise the abstract, introduction, and discussion to (i) state explicitly that we observe analogous patterns without claiming mechanistic equivalence, (ii) note the architectural differences (attention-free vs. transformer attention), and (iii) acknowledge the absence of direct ablations on context-length scaling or token dependencies. These changes will clarify that the representativeness claim is limited to the recurrence of the reported qualitative phenomena. revision: yes

Circularity Check

0 steps flagged

No circularity: results from direct experiments on known distributions

full rationale

The paper frames prompting as conditioning a near-optimal sequence predictor and reports empirical results from training binary predictors on fully specified distributions, identifying optimal sequences via search, and comparing to task demonstrations. These are direct measurements, not derivations. No equations, fitted parameters renamed as predictions, or self-citation chains are invoked to force the central claims. The extension to frontier models is presented as an empirical check showing similar patterns. The work is self-contained against external benchmarks via controlled experiments; the modeling choice of binary predictors does not create a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that prompting equals conditioning a near-optimal sequence predictor and that binary predictors are representative; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Prompting can be viewed as finding the best conditioning sequence on a near-optimal sequence predictor.
Explicitly stated as the modeling choice in the abstract.

pith-pipeline@v0.9.0 · 5681 in / 1289 out tokens · 29893 ms · 2026-05-23T02:35:11.225171+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references

[1]

Map the binary tokens xt ∈ {0, 1} to embeddings et ∈ Rh through et = Wembxt where Wemb ∈ Rh×2, and h ∈ N+ is the hidden size
[2]

Sequentially map h1:T through some neural architecture, called the torso, such as LSTM, multi-head attention, etc, to obtain some hidden activations ut ∈ Rh
[3]

For each t, map ut through the fully connected MLP to vt ∈ Rh that is usually found after the attention layer in a transformer block (Vaswani, 2017)

2017
[4]

There is also a residual connection from step 2 to 3 and from 3 to 4

For each t, map vt to output logits through a linear map. There is also a residual connection from step 2 to 3 and from 3 to 4. The different neural architectures differ only by the torso. This maintains a flexible enough architecture for different tasks while controlling for the model complexity between different architectures. For the torso, we use the ...
[5]

Vanilla recurrent neural networks (Elman, 1990)

1990
[6]

Long-short term memory (LSTM) (Hochreiter & Schmidhuber, 1997), reported in main text

1997
[7]

sLSTM (Beck et al., 2024)

2024
[8]

Softmax-attention transformer (Transformer) (Vaswani, 2017), reported in main text

2017
[9]

Linear transformer (Katharopoulos et al., 2020)

2020
[10]

landscape

Another variant of Linear transformer we refer to as Inner-product transformer (IP transformer) (Li et al., 2020; Shen et al., 2021) We found that step 3 above is crucial for transformer architectures to perform some of the tasks, although this is not essential for LSTMs to perform well, so we leave this stage in for all model architectures. We did use no...

2020
[11]

mismatch

The first bias y1 associated with s1 may not be the first ε appearing in Definition 5.1. In other words, the “phase” of the y is unknown and y1:L. Taking λ = 3 for example, y can start with any of the following: [ε, ε, ε, 1−ε, 1−ε, 1−ε, . . . ] [ε, ε, 1−ε, 1−ε, 1−ε, ε, . . . ] [ε, 1−ε, 1−ε, 1−ε, ε, ε, . . . ] [1−ε, 1−ε, 1−ε, ε, ε, ε, . . . ] [1−ε, 1−ε, ε,...

2021
[12]

retrieval

Together with the constantly reward streak from the other arm, creating a posterior of Beta (1 + 7τ, 1), choosing the rewarding arm is more likely if τ is large. Essentially, a large reward gap between the two Beta distributions helps the predictor identify τ. 30 Optimal prompts for sequence predictors Another large reward gap can be induced by other rewa...

1957

[1] [1]

Map the binary tokens xt ∈ {0, 1} to embeddings et ∈ Rh through et = Wembxt where Wemb ∈ Rh×2, and h ∈ N+ is the hidden size

[2] [2]

Sequentially map h1:T through some neural architecture, called the torso, such as LSTM, multi-head attention, etc, to obtain some hidden activations ut ∈ Rh

[3] [3]

For each t, map ut through the fully connected MLP to vt ∈ Rh that is usually found after the attention layer in a transformer block (Vaswani, 2017)

2017

[4] [4]

There is also a residual connection from step 2 to 3 and from 3 to 4

For each t, map vt to output logits through a linear map. There is also a residual connection from step 2 to 3 and from 3 to 4. The different neural architectures differ only by the torso. This maintains a flexible enough architecture for different tasks while controlling for the model complexity between different architectures. For the torso, we use the ...

[5] [5]

Vanilla recurrent neural networks (Elman, 1990)

1990

[6] [6]

Long-short term memory (LSTM) (Hochreiter & Schmidhuber, 1997), reported in main text

1997

[7] [7]

sLSTM (Beck et al., 2024)

2024

[8] [8]

Softmax-attention transformer (Transformer) (Vaswani, 2017), reported in main text

2017

[9] [9]

Linear transformer (Katharopoulos et al., 2020)

2020

[10] [10]

landscape

Another variant of Linear transformer we refer to as Inner-product transformer (IP transformer) (Li et al., 2020; Shen et al., 2021) We found that step 3 above is crucial for transformer architectures to perform some of the tasks, although this is not essential for LSTMs to perform well, so we leave this stage in for all model architectures. We did use no...

2020

[11] [11]

mismatch

The first bias y1 associated with s1 may not be the first ε appearing in Definition 5.1. In other words, the “phase” of the y is unknown and y1:L. Taking λ = 3 for example, y can start with any of the following: [ε, ε, ε, 1−ε, 1−ε, 1−ε, . . . ] [ε, ε, 1−ε, 1−ε, 1−ε, ε, . . . ] [ε, 1−ε, 1−ε, 1−ε, ε, ε, . . . ] [1−ε, 1−ε, 1−ε, ε, ε, ε, . . . ] [1−ε, 1−ε, ε,...

2021

[12] [12]

retrieval

Together with the constantly reward streak from the other arm, creating a posterior of Beta (1 + 7τ, 1), choosing the rewarding arm is more likely if τ is large. Essentially, a large reward gap between the two Beta distributions helps the predictor identify τ. 30 Optimal prompts for sequence predictors Another large reward gap can be induced by other rewa...

1957