Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling
Pith reviewed 2026-05-19 09:20 UTC · model grok-4.3
The pith
Verbalized Rejection Sampling reduces bias in LLM coin flips by prompting models to accept or reject their own proposals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Verbalized Rejection Sampling adapts classical rejection sampling to natural-language prompts so that an LLM proposes Bernoulli samples and then evaluates whether to accept or reject them based on its own description of the target distribution. Despite using the identical internal sampling process as direct prompting, the verbalized version produces outputs whose empirical frequencies are closer to the intended probabilities across multiple models. The paper supplies a theoretical argument that, under mild assumptions on the model’s ability to describe distributions, the method yields lower bias than direct sampling, with separate contributions traceable to the rejection algorithm and to the
What carries the argument
Verbalized Rejection Sampling (VRS), which prompts the model to generate a candidate sample and then reason step-by-step about accepting or rejecting it according to the target Bernoulli probability.
If this is right
- Monte Carlo estimates and agent simulations that rely on LLM randomness become more trustworthy.
- The same verbal rejection pattern can be applied to other discrete distributions without changing model internals.
- Performance gains arise separately from the algorithmic structure and from the style of the prompt.
- No white-box access or fine-tuning is required to obtain the bias reduction.
Where Pith is reading between the lines
- The technique may extend to continuous distributions if the model can verbalize acceptance regions.
- Combining VRS with temperature tuning or few-shot examples could produce further bias reductions.
- The method provides a practical testbed for studying how explicit reasoning affects implicit sampling behavior in LLMs.
Load-bearing premise
The model can be prompted to give a coherent verbal description of the desired probability and to use that description when deciding whether to keep or discard a proposed sample.
What would settle it
Measure the empirical frequency of heads when many LLMs generate 10,000 coin flips with direct prompting versus with VRS; the claim is falsified if VRS does not produce frequencies closer to 0.5.
Figures
read the original abstract
Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling in which an LLM is prompted to reason about and accept or reject proposed Bernoulli samples. The central claims are that VRS substantially reduces sampling bias relative to direct prompting across multiple models, that this improvement holds under mild assumptions with contributions from both the algorithmic structure and the prompt design, and that the approach demonstrates how classical probabilistic tools can be verbalized to improve LLM reliability without internal access or heavy engineering.
Significance. If the theoretical analysis and empirical results hold, the work provides a concrete, accessible method for eliciting more faithful stochastic outputs from LLMs. This is relevant for Monte Carlo methods, agent simulations, and randomized decision-making where LLMs are already used but suffer from sampling bias. The verbalization of rejection sampling is a clear example of embedding established probabilistic techniques into LLM workflows, and the separation of algorithmic versus prompt contributions offers a useful lens for future prompt-based sampling research.
major comments (2)
- [§3] §3 (theoretical analysis): the claim that VRS improves over direct sampling 'under mild assumptions' requires an explicit statement of those assumptions together with a short derivation or proof sketch showing that the acceptance probability corrects the LLM's internal bias without introducing new free parameters; the current description leaves open whether the improvement is guaranteed or merely plausible.
- [§4] §4 (empirical evaluation): the reported bias reductions should be accompanied by per-model sample counts, exact prompt templates, and a control that isolates the rejection step from prompt wording changes; without these, it is difficult to attribute gains to the algorithm versus prompt design as asserted.
minor comments (2)
- [Introduction] The abstract and introduction would benefit from a one-sentence comparison to related work on LLM sampling bias (e.g., temperature scaling or logit bias methods) to clarify novelty.
- [Preliminaries] Notation for the internal Bernoulli parameter p and the target distribution should be introduced once and used consistently; currently the distinction is clear in prose but could be formalized in a small table or equation.
Simulated Author's Rebuttal
We thank the referee for their constructive review and recommendation for minor revision. We address each major comment below with clarifications and planned revisions to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [§3] §3 (theoretical analysis): the claim that VRS improves over direct sampling 'under mild assumptions' requires an explicit statement of those assumptions together with a short derivation or proof sketch showing that the acceptance probability corrects the LLM's internal bias without introducing new free parameters; the current description leaves open whether the improvement is guaranteed or merely plausible.
Authors: We agree that greater explicitness in §3 will strengthen the presentation. In the revision we will add a short subsection stating the mild assumptions: (i) the LLM's sampling bias arises from its internal next-token distribution rather than from the verbalized reasoning process, and (ii) the model can use natural-language reasoning to evaluate whether a proposed sample matches the target Bernoulli parameter. We will include a concise derivation showing that the verbalized acceptance step reweights the output distribution toward the unbiased target without introducing any new free parameters. Under these assumptions the improvement is guaranteed, not merely plausible. revision: yes
-
Referee: [§4] §4 (empirical evaluation): the reported bias reductions should be accompanied by per-model sample counts, exact prompt templates, and a control that isolates the rejection step from prompt wording changes; without these, it is difficult to attribute gains to the algorithm versus prompt design as asserted.
Authors: We thank the referee for this suggestion. The revised manuscript will include a table listing the exact number of samples drawn per model and per method. All prompt templates will be reproduced verbatim in a new appendix. We will also add a control condition that employs the identical verbalized reasoning prompt but omits the rejection step, so that the model directly outputs the reasoned sample. This isolates the algorithmic contribution of rejection sampling from prompt-design effects and supports the claim that both components contribute to the observed bias reduction. revision: yes
Circularity Check
No significant circularity; derivation is self-contained under stated assumptions
full rationale
The paper adapts classical rejection sampling to a verbalized LLM prompt (VRS) and provides a theoretical analysis showing improvement over direct sampling under mild assumptions. No load-bearing step reduces by construction to fitted parameters, self-referential definitions, or a self-citation chain. The abstract explicitly separates algorithmic gains from prompt design and grounds the claim in standard probabilistic properties rather than internal LLM mechanisms or prior author results. The reader's assessment of independence aligns with the absence of any quoted equation or premise that equates the output to its inputs. This is the expected honest non-finding for a paper whose central claim rests on an external classical tool adapted to a new domain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mild assumptions under which VRS improves over direct sampling
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. ... TV(˜P, P) ≤ M c / (1 − M c)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 1 ... TV(˜P , P) ≤ M c / (1 − M c)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Maya Bar-Hillel, Eyal Peer, and Alessandro Acquisti. “heads or tails?”—a reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition , 40(6):1656, 2014. 5
work page 2014
-
[2]
V ocabulary for universal approximation: A linguistic perspective of mapping compositions
Yongqiang Cai. V ocabulary for universal approximation: A linguistic perspective of mapping compositions. arXiv preprint arXiv:2305.12205, 2023. 3
-
[3]
Specializing large language models to simulate survey response distributions for global populations
Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul Röttger, and Daniel Hersh- covich. Specializing large language models to simulate survey response distributions for global populations. arXiv preprint arXiv:2502.07068, 2025. 1
-
[4]
Stan: A probabilistic programming language
Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of statistical software, 76:1–32, 2017. 1
work page 2017
-
[5]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Jia Gu, Liang Pang, Huawei Shen, and Xueqi Cheng. Do llms play dice? exploring probability distribution sampling in large language models for behavioral simulation. arXiv preprint arXiv:2404.09043, 2024. 1, 2
-
[7]
Enough coin flips can make llms act bayesian
Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, and David M Chan. Enough coin flips can make llms act bayesian. arXiv preprint arXiv:2503.04722, 2025. 1, 2, 4, 5
-
[8]
Can llms generate random numbers? evaluating llm sampling in controlled domains
Aspen K Hopkins and Alex Renda. Can llms generate random numbers? evaluating llm sampling in controlled domains. In Sampling and Optimization in Discrete Space (SODS) ICML 2023 Workshop, 2023. 3, 4, 5
work page 2023
-
[9]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 4, 19
work page 2023
-
[10]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Benchmarking distributional alignment of large language models
Nicole Meister, Carlos Guestrin, and Tatsunori Hashimoto. Benchmarking distributional alignment of large language models. arXiv preprint arXiv:2411.05403, 2024. 1, 3
-
[12]
Llm processes: Numerical predictive distributions conditioned on natural language
James Requeima, John Bronskill, Dami Choi, Richard Turner, and David K Duvenaud. Llm processes: Numerical predictive distributions conditioned on natural language. In NeurIPS,
-
[13]
Monte Carlo statistical methods , volume 2
Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods , volume 2. Springer, 1999. 1
work page 1999
-
[14]
How random is random? evaluating the random- ness and humaness of llms’ coin flips
Katherine Van Koevering and Jon Kleinberg. How random is random? evaluating the random- ness and humaness of llms’ coin flips. arXiv preprint arXiv:2406.00092, 2024. 2, 5
-
[15]
Alicia Vidler and Toby Walsh. Evaluating binary decision biases in large language models: Implications for fair agent-based financial simulations. arXiv preprint arXiv:2501.16356, 2025. 1
-
[16]
Statistical decision functions
Abraham Wald. Statistical decision functions. The Annals of Mathematical Statistics, pages 165–205, 1949. 1
work page 1949
-
[17]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. 4, 5 10
work page 2022
-
[18]
Tim Z Xiao, Robert Bamler, Bernhard Schölkopf, and Weiyang Liu. Verbalized machine learning: Revisiting machine learning with language models.Transactions on Machine Learning Research, 2025. 3
work page 2025
-
[19]
Yongjian Xu, Akash Nandi, and Evangelos Markopoulos. Application of large language models in stochastic sampling algorithms for predictive modeling of population behavior. Artificial Intelligence and Social Computing, 122:10–20, 2024. 1
work page 2024
-
[20]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 5 11 Appendix for VRS Table of Contents A Biased (Rejection) Sampling from Bernoulli Distributions 13 A.1 Total Variation Distance . . . . . . . . . . . . . . . ....
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.