pith. sign in

arxiv: 2506.09998 · v2 · submitted 2025-06-11 · 💻 cs.LG · cs.CL

Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling

Pith reviewed 2026-05-19 09:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords large language modelssampling biasrejection samplingBernoulli distributionprompt engineeringstochastic generationMonte Carlo methods
0
0 comments X

The pith

Verbalized Rejection Sampling reduces bias in LLM coin flips by prompting models to accept or reject their own proposals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can describe probability distributions in words but often produce biased samples when asked to generate them directly. The paper introduces Verbalized Rejection Sampling, a method that asks the model to propose a sample and then reason about whether to keep or discard it. This approach cuts sampling bias on Bernoulli tasks such as fair coin flips even though the underlying random mechanism stays the same. The improvement comes from both the rejection step and the way the prompt guides the model’s reasoning. If the method works as claimed, it offers a lightweight way to make LLM-driven simulations and randomized decisions more reliable without inspecting model weights.

Core claim

Verbalized Rejection Sampling adapts classical rejection sampling to natural-language prompts so that an LLM proposes Bernoulli samples and then evaluates whether to accept or reject them based on its own description of the target distribution. Despite using the identical internal sampling process as direct prompting, the verbalized version produces outputs whose empirical frequencies are closer to the intended probabilities across multiple models. The paper supplies a theoretical argument that, under mild assumptions on the model’s ability to describe distributions, the method yields lower bias than direct sampling, with separate contributions traceable to the rejection algorithm and to the

What carries the argument

Verbalized Rejection Sampling (VRS), which prompts the model to generate a candidate sample and then reason step-by-step about accepting or rejecting it according to the target Bernoulli probability.

If this is right

  • Monte Carlo estimates and agent simulations that rely on LLM randomness become more trustworthy.
  • The same verbal rejection pattern can be applied to other discrete distributions without changing model internals.
  • Performance gains arise separately from the algorithmic structure and from the style of the prompt.
  • No white-box access or fine-tuning is required to obtain the bias reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique may extend to continuous distributions if the model can verbalize acceptance regions.
  • Combining VRS with temperature tuning or few-shot examples could produce further bias reductions.
  • The method provides a practical testbed for studying how explicit reasoning affects implicit sampling behavior in LLMs.

Load-bearing premise

The model can be prompted to give a coherent verbal description of the desired probability and to use that description when deciding whether to keep or discard a proposed sample.

What would settle it

Measure the empirical frequency of heads when many LLMs generate 10,000 coin flips with direct prompting versus with VRS; the claim is falsified if VRS does not produce frequencies closer to 0.5.

Figures

Figures reproduced from arXiv: 2506.09998 by Bernhard Sch\"olkopf, Johannes Zenn, Robert Bamler, Tim Z. Xiao, Weiyang Liu, Zhen Liu.

Figure 1
Figure 1. Figure 1: Illustrations of the knowledge-sampling gap and two different sampling methods. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt templates for direct sampling and Verbalized Rejection Sampling. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Recognition accuracy matrix. The left panel shows high off-diagonal accuracy for Python generated data (i.e., confidently rejecting incorrect hypotheses), with minor errors along the diagonal due to natural sample variation (e.g., 48 ones out of 100 for p = 0.5 may lead to confusion with p = 0.48, hence, rejecting the correct hy￾potheses). In contrast, the right panel shows major degradation for LLM-genera… view at source ↗
Figure 4
Figure 4. Figure 4: Calibration plots for direct sampling and the four different phrasings. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Calibration plots and STVD trend for various reasoning length constraints. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: STVD vs CoT Length [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Empirical acceptance rates for VRS. the LLM setting, we design a prompt template ( [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Calibration plot for VRS We evaluate VRS on four different LLMs: Llama-3.1, GPT-4.1- nano, DeepSeekV3, and Qwen-2.5. For each model, we run VRS until it accepts 100 samples for each of the 101 values of p ∈ [0.0, 1.0], following the same setup as in the direct sampling experiments. As the proposal distribution Q, we fix it to a uniform Bernoulli with q = 0.5 across all values of p. The resulting calibratio… view at source ↗
Figure 9
Figure 9. Figure 9: Calibration plots for two ablations and an example LLMs output for VRS-simple. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Calibration plots with error bounds ±c overlaid. In our experiments we fix the proposal to q = 0.5. This allows us, for each p, to compute the constants M and c (i.e., the upper bound on |e(x)|) under which VRS outperforms direct sampling [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Calibration for various reasoning length constraints in direct sampling: GPT-4.1-nano [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Calibration of direct sampling for GPT-4.1-nano (top left), Qwen-2.5 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Calibration of VRS for GPT-4.1-nano (top left), Qwen-2.5 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: shows the calibration plot for VRS-M, which is a ablation of VRS by providing the model with the description on how M is computed. The corresponding STVD scores can be found in [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
read the original abstract

Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling in which an LLM is prompted to reason about and accept or reject proposed Bernoulli samples. The central claims are that VRS substantially reduces sampling bias relative to direct prompting across multiple models, that this improvement holds under mild assumptions with contributions from both the algorithmic structure and the prompt design, and that the approach demonstrates how classical probabilistic tools can be verbalized to improve LLM reliability without internal access or heavy engineering.

Significance. If the theoretical analysis and empirical results hold, the work provides a concrete, accessible method for eliciting more faithful stochastic outputs from LLMs. This is relevant for Monte Carlo methods, agent simulations, and randomized decision-making where LLMs are already used but suffer from sampling bias. The verbalization of rejection sampling is a clear example of embedding established probabilistic techniques into LLM workflows, and the separation of algorithmic versus prompt contributions offers a useful lens for future prompt-based sampling research.

major comments (2)
  1. [§3] §3 (theoretical analysis): the claim that VRS improves over direct sampling 'under mild assumptions' requires an explicit statement of those assumptions together with a short derivation or proof sketch showing that the acceptance probability corrects the LLM's internal bias without introducing new free parameters; the current description leaves open whether the improvement is guaranteed or merely plausible.
  2. [§4] §4 (empirical evaluation): the reported bias reductions should be accompanied by per-model sample counts, exact prompt templates, and a control that isolates the rejection step from prompt wording changes; without these, it is difficult to attribute gains to the algorithm versus prompt design as asserted.
minor comments (2)
  1. [Introduction] The abstract and introduction would benefit from a one-sentence comparison to related work on LLM sampling bias (e.g., temperature scaling or logit bias methods) to clarify novelty.
  2. [Preliminaries] Notation for the internal Bernoulli parameter p and the target distribution should be introduced once and used consistently; currently the distinction is clear in prose but could be formalized in a small table or equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address each major comment below with clarifications and planned revisions to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical analysis): the claim that VRS improves over direct sampling 'under mild assumptions' requires an explicit statement of those assumptions together with a short derivation or proof sketch showing that the acceptance probability corrects the LLM's internal bias without introducing new free parameters; the current description leaves open whether the improvement is guaranteed or merely plausible.

    Authors: We agree that greater explicitness in §3 will strengthen the presentation. In the revision we will add a short subsection stating the mild assumptions: (i) the LLM's sampling bias arises from its internal next-token distribution rather than from the verbalized reasoning process, and (ii) the model can use natural-language reasoning to evaluate whether a proposed sample matches the target Bernoulli parameter. We will include a concise derivation showing that the verbalized acceptance step reweights the output distribution toward the unbiased target without introducing any new free parameters. Under these assumptions the improvement is guaranteed, not merely plausible. revision: yes

  2. Referee: [§4] §4 (empirical evaluation): the reported bias reductions should be accompanied by per-model sample counts, exact prompt templates, and a control that isolates the rejection step from prompt wording changes; without these, it is difficult to attribute gains to the algorithm versus prompt design as asserted.

    Authors: We thank the referee for this suggestion. The revised manuscript will include a table listing the exact number of samples drawn per model and per method. All prompt templates will be reproduced verbatim in a new appendix. We will also add a control condition that employs the identical verbalized reasoning prompt but omits the rejection step, so that the model directly outputs the reasoned sample. This isolates the algorithmic contribution of rejection sampling from prompt-design effects and supports the claim that both components contribute to the observed bias reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained under stated assumptions

full rationale

The paper adapts classical rejection sampling to a verbalized LLM prompt (VRS) and provides a theoretical analysis showing improvement over direct sampling under mild assumptions. No load-bearing step reduces by construction to fitted parameters, self-referential definitions, or a self-citation chain. The abstract explicitly separates algorithmic gains from prompt design and grounds the claim in standard probabilistic properties rather than internal LLM mechanisms or prior author results. The reader's assessment of independence aligns with the absence of any quoted equation or premise that equates the output to its inputs. This is the expected honest non-finding for a paper whose central claim rests on an external classical tool adapted to a new domain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the work relies on standard probabilistic rejection sampling and LLM prompting capabilities without introducing new free parameters or entities.

axioms (1)
  • domain assumption Mild assumptions under which VRS improves over direct sampling
    Referenced in the abstract as the basis for the theoretical analysis.

pith-pipeline@v0.9.0 · 5705 in / 1060 out tokens · 38075 ms · 2026-05-19T09:20:59.627222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 3 internal anchors

  1. [1]

    heads or tails?

    Maya Bar-Hillel, Eyal Peer, and Alessandro Acquisti. “heads or tails?”—a reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition , 40(6):1656, 2014. 5

  2. [2]

    V ocabulary for universal approximation: A linguistic perspective of mapping compositions

    Yongqiang Cai. V ocabulary for universal approximation: A linguistic perspective of mapping compositions. arXiv preprint arXiv:2305.12205, 2023. 3

  3. [3]

    Specializing large language models to simulate survey response distributions for global populations

    Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul Röttger, and Daniel Hersh- covich. Specializing large language models to simulate survey response distributions for global populations. arXiv preprint arXiv:2502.07068, 2025. 1

  4. [4]

    Stan: A probabilistic programming language

    Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of statistical software, 76:1–32, 2017. 1

  5. [5]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 4

  6. [6]

    Do llms play dice? exploring probability distribution sampling in large language models for behavioral simulation

    Jia Gu, Liang Pang, Huawei Shen, and Xueqi Cheng. Do llms play dice? exploring probability distribution sampling in large language models for behavioral simulation. arXiv preprint arXiv:2404.09043, 2024. 1, 2

  7. [7]

    Enough coin flips can make llms act bayesian

    Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, and David M Chan. Enough coin flips can make llms act bayesian. arXiv preprint arXiv:2503.04722, 2025. 1, 2, 4, 5

  8. [8]

    Can llms generate random numbers? evaluating llm sampling in controlled domains

    Aspen K Hopkins and Alex Renda. Can llms generate random numbers? evaluating llm sampling in controlled domains. In Sampling and Optimization in Discrete Space (SODS) ICML 2023 Workshop, 2023. 3, 4, 5

  9. [9]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 4, 19

  10. [10]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5

  11. [11]

    Benchmarking distributional alignment of large language models

    Nicole Meister, Carlos Guestrin, and Tatsunori Hashimoto. Benchmarking distributional alignment of large language models. arXiv preprint arXiv:2411.05403, 2024. 1, 3

  12. [12]

    Llm processes: Numerical predictive distributions conditioned on natural language

    James Requeima, John Bronskill, Dami Choi, Richard Turner, and David K Duvenaud. Llm processes: Numerical predictive distributions conditioned on natural language. In NeurIPS,

  13. [13]

    Monte Carlo statistical methods , volume 2

    Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods , volume 2. Springer, 1999. 1

  14. [14]

    How random is random? evaluating the random- ness and humaness of llms’ coin flips

    Katherine Van Koevering and Jon Kleinberg. How random is random? evaluating the random- ness and humaness of llms’ coin flips. arXiv preprint arXiv:2406.00092, 2024. 2, 5

  15. [15]

    Evaluating binary decision biases in large language models: Implications for fair agent-based financial simulations

    Alicia Vidler and Toby Walsh. Evaluating binary decision biases in large language models: Implications for fair agent-based financial simulations. arXiv preprint arXiv:2501.16356, 2025. 1

  16. [16]

    Statistical decision functions

    Abraham Wald. Statistical decision functions. The Annals of Mathematical Statistics, pages 165–205, 1949. 1

  17. [17]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. 4, 5 10

  18. [18]

    Verbalized machine learning: Revisiting machine learning with language models.Transactions on Machine Learning Research, 2025

    Tim Z Xiao, Robert Bamler, Bernhard Schölkopf, and Weiyang Liu. Verbalized machine learning: Revisiting machine learning with language models.Transactions on Machine Learning Research, 2025. 3

  19. [19]

    Application of large language models in stochastic sampling algorithms for predictive modeling of population behavior

    Yongjian Xu, Akash Nandi, and Evangelos Markopoulos. Application of large language models in stochastic sampling algorithms for predictive modeling of population behavior. Artificial Intelligence and Social Computing, 122:10–20, 2024. 1

  20. [20]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 5 11 Appendix for VRS Table of Contents A Biased (Rejection) Sampling from Bernoulli Distributions 13 A.1 Total Variation Distance . . . . . . . . . . . . . . . ....