Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling

Bernhard Sch\"olkopf; Johannes Zenn; Robert Bamler; Tim Z. Xiao; Weiyang Liu; Zhen Liu

arxiv: 2506.09998 · v2 · submitted 2025-06-11 · 💻 cs.LG · cs.CL

Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling

Tim Z. Xiao , Johannes Zenn , Zhen Liu , Weiyang Liu , Robert Bamler , Bernhard Sch\"olkopf This is my paper

Pith reviewed 2026-05-19 09:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords large language modelssampling biasrejection samplingBernoulli distributionprompt engineeringstochastic generationMonte Carlo methods

0 comments

The pith

Verbalized Rejection Sampling reduces bias in LLM coin flips by prompting models to accept or reject their own proposals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can describe probability distributions in words but often produce biased samples when asked to generate them directly. The paper introduces Verbalized Rejection Sampling, a method that asks the model to propose a sample and then reason about whether to keep or discard it. This approach cuts sampling bias on Bernoulli tasks such as fair coin flips even though the underlying random mechanism stays the same. The improvement comes from both the rejection step and the way the prompt guides the model’s reasoning. If the method works as claimed, it offers a lightweight way to make LLM-driven simulations and randomized decisions more reliable without inspecting model weights.

Core claim

Verbalized Rejection Sampling adapts classical rejection sampling to natural-language prompts so that an LLM proposes Bernoulli samples and then evaluates whether to accept or reject them based on its own description of the target distribution. Despite using the identical internal sampling process as direct prompting, the verbalized version produces outputs whose empirical frequencies are closer to the intended probabilities across multiple models. The paper supplies a theoretical argument that, under mild assumptions on the model’s ability to describe distributions, the method yields lower bias than direct sampling, with separate contributions traceable to the rejection algorithm and to the

What carries the argument

Verbalized Rejection Sampling (VRS), which prompts the model to generate a candidate sample and then reason step-by-step about accepting or rejecting it according to the target Bernoulli probability.

If this is right

Monte Carlo estimates and agent simulations that rely on LLM randomness become more trustworthy.
The same verbal rejection pattern can be applied to other discrete distributions without changing model internals.
Performance gains arise separately from the algorithmic structure and from the style of the prompt.
No white-box access or fine-tuning is required to obtain the bias reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may extend to continuous distributions if the model can verbalize acceptance regions.
Combining VRS with temperature tuning or few-shot examples could produce further bias reductions.
The method provides a practical testbed for studying how explicit reasoning affects implicit sampling behavior in LLMs.

Load-bearing premise

The model can be prompted to give a coherent verbal description of the desired probability and to use that description when deciding whether to keep or discard a proposed sample.

What would settle it

Measure the empirical frequency of heads when many LLMs generate 10,000 coin flips with direct prompting versus with VRS; the claim is falsified if VRS does not produce frequencies closer to 0.5.

Figures

Figures reproduced from arXiv: 2506.09998 by Bernhard Sch\"olkopf, Johannes Zenn, Robert Bamler, Tim Z. Xiao, Weiyang Liu, Zhen Liu.

**Figure 2.** Figure 2: Prompt templates for direct sampling and Verbalized Rejection Sampling. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Recognition accuracy matrix. The left panel shows high off-diagonal accuracy for Python generated data (i.e., confidently rejecting incorrect hypotheses), with minor errors along the diagonal due to natural sample variation (e.g., 48 ones out of 100 for p = 0.5 may lead to confusion with p = 0.48, hence, rejecting the correct hypotheses). In contrast, the right panel shows major degradation for LLM-genera… view at source ↗

**Figure 4.** Figure 4: Calibration plots for direct sampling and the four different phrasings. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Calibration plots and STVD trend for various reasoning length constraints. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: STVD vs CoT Length [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 8.** Figure 8: Empirical acceptance rates for VRS. the LLM setting, we design a prompt template ( [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 7.** Figure 7: Calibration plot for VRS We evaluate VRS on four different LLMs: Llama-3.1, GPT-4.1- nano, DeepSeekV3, and Qwen-2.5. For each model, we run VRS until it accepts 100 samples for each of the 101 values of p ∈ [0.0, 1.0], following the same setup as in the direct sampling experiments. As the proposal distribution Q, we fix it to a uniform Bernoulli with q = 0.5 across all values of p. The resulting calibratio… view at source ↗

**Figure 9.** Figure 9: Calibration plots for two ablations and an example LLMs output for VRS-simple. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Calibration plots with error bounds ±c overlaid. In our experiments we fix the proposal to q = 0.5. This allows us, for each p, to compute the constants M and c (i.e., the upper bound on |e(x)|) under which VRS outperforms direct sampling [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Calibration for various reasoning length constraints in direct sampling: GPT-4.1-nano [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Calibration of direct sampling for GPT-4.1-nano (top left), Qwen-2.5 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Calibration of VRS for GPT-4.1-nano (top left), Qwen-2.5 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: shows the calibration plot for VRS-M, which is a ablation of VRS by providing the model with the description on how M is computed. The corresponding STVD scores can be found in [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VRS turns rejection sampling into a prompt that cuts LLM bias on coin flips, with theory under mild assumptions and cross-model experiments showing the gain.

read the letter

The core point is that this paper takes classical rejection sampling and puts it into natural-language prompts so the LLM reasons about accepting or rejecting proposed Bernoulli samples. The result is lower bias than direct prompting, even though the underlying coin-flip mechanism stays the same. That is the main new piece: a verbalized version of an old algorithm that works without touching model weights or doing heavy prompt engineering. The authors also give a short theoretical argument that the method improves on direct sampling under mild assumptions, and they report empirical bias reductions across several models. Both the algorithmic structure and the prompt wording appear to contribute, which is a useful distinction to make. The work is relevant for anyone trying to run Monte Carlo loops or randomized decisions inside LLM agents, since faithful sampling has been a practical blocker. The approach is simple enough that others could try it quickly. On the softer side, the theoretical assumptions are described as mild but would need checking against how current LLMs actually behave when asked to reason about acceptance probabilities. The experiments claim clear gains, yet the abstract leaves open how much of the improvement is from the rejection step versus from simply giving the model more structured text to follow. Without the full derivations and raw data it is hard to judge reproducibility or whether prompt variations were fully controlled. The citation pattern looks standard and does not rely on self-reference for the main claims. This paper is for people working on reliable stochastic behavior in language models rather than for the general LLM audience. A reader who cares about prompt-based fixes for probabilistic tasks would get concrete value from the method and the analysis. It has enough substance and testability to deserve a serious referee, even if the details need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling in which an LLM is prompted to reason about and accept or reject proposed Bernoulli samples. The central claims are that VRS substantially reduces sampling bias relative to direct prompting across multiple models, that this improvement holds under mild assumptions with contributions from both the algorithmic structure and the prompt design, and that the approach demonstrates how classical probabilistic tools can be verbalized to improve LLM reliability without internal access or heavy engineering.

Significance. If the theoretical analysis and empirical results hold, the work provides a concrete, accessible method for eliciting more faithful stochastic outputs from LLMs. This is relevant for Monte Carlo methods, agent simulations, and randomized decision-making where LLMs are already used but suffer from sampling bias. The verbalization of rejection sampling is a clear example of embedding established probabilistic techniques into LLM workflows, and the separation of algorithmic versus prompt contributions offers a useful lens for future prompt-based sampling research.

major comments (2)

[§3] §3 (theoretical analysis): the claim that VRS improves over direct sampling 'under mild assumptions' requires an explicit statement of those assumptions together with a short derivation or proof sketch showing that the acceptance probability corrects the LLM's internal bias without introducing new free parameters; the current description leaves open whether the improvement is guaranteed or merely plausible.
[§4] §4 (empirical evaluation): the reported bias reductions should be accompanied by per-model sample counts, exact prompt templates, and a control that isolates the rejection step from prompt wording changes; without these, it is difficult to attribute gains to the algorithm versus prompt design as asserted.

minor comments (2)

[Introduction] The abstract and introduction would benefit from a one-sentence comparison to related work on LLM sampling bias (e.g., temperature scaling or logit bias methods) to clarify novelty.
[Preliminaries] Notation for the internal Bernoulli parameter p and the target distribution should be introduced once and used consistently; currently the distinction is clear in prose but could be formalized in a small table or equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address each major comment below with clarifications and planned revisions to improve clarity and reproducibility.

read point-by-point responses

Referee: [§3] §3 (theoretical analysis): the claim that VRS improves over direct sampling 'under mild assumptions' requires an explicit statement of those assumptions together with a short derivation or proof sketch showing that the acceptance probability corrects the LLM's internal bias without introducing new free parameters; the current description leaves open whether the improvement is guaranteed or merely plausible.

Authors: We agree that greater explicitness in §3 will strengthen the presentation. In the revision we will add a short subsection stating the mild assumptions: (i) the LLM's sampling bias arises from its internal next-token distribution rather than from the verbalized reasoning process, and (ii) the model can use natural-language reasoning to evaluate whether a proposed sample matches the target Bernoulli parameter. We will include a concise derivation showing that the verbalized acceptance step reweights the output distribution toward the unbiased target without introducing any new free parameters. Under these assumptions the improvement is guaranteed, not merely plausible. revision: yes
Referee: [§4] §4 (empirical evaluation): the reported bias reductions should be accompanied by per-model sample counts, exact prompt templates, and a control that isolates the rejection step from prompt wording changes; without these, it is difficult to attribute gains to the algorithm versus prompt design as asserted.

Authors: We thank the referee for this suggestion. The revised manuscript will include a table listing the exact number of samples drawn per model and per method. All prompt templates will be reproduced verbatim in a new appendix. We will also add a control condition that employs the identical verbalized reasoning prompt but omits the rejection step, so that the model directly outputs the reasoned sample. This isolates the algorithmic contribution of rejection sampling from prompt-design effects and supports the claim that both components contribute to the observed bias reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained under stated assumptions

full rationale

The paper adapts classical rejection sampling to a verbalized LLM prompt (VRS) and provides a theoretical analysis showing improvement over direct sampling under mild assumptions. No load-bearing step reduces by construction to fitted parameters, self-referential definitions, or a self-citation chain. The abstract explicitly separates algorithmic gains from prompt design and grounds the claim in standard probabilistic properties rather than internal LLM mechanisms or prior author results. The reader's assessment of independence aligns with the absence of any quoted equation or premise that equates the output to its inputs. This is the expected honest non-finding for a paper whose central claim rests on an external classical tool adapted to a new domain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the work relies on standard probabilistic rejection sampling and LLM prompting capabilities without introducing new free parameters or entities.

axioms (1)

domain assumption Mild assumptions under which VRS improves over direct sampling
Referenced in the abstract as the basis for the theoretical analysis.

pith-pipeline@v0.9.0 · 5705 in / 1060 out tokens · 38075 ms · 2026-05-19T09:20:59.627222+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. ... TV(˜P, P) ≤ M c / (1 − M c)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1 ... TV(˜P , P) ≤ M c / (1 − M c)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 3 internal anchors

[1]

heads or tails?

Maya Bar-Hillel, Eyal Peer, and Alessandro Acquisti. “heads or tails?”—a reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition , 40(6):1656, 2014. 5

work page 2014
[2]

V ocabulary for universal approximation: A linguistic perspective of mapping compositions

Yongqiang Cai. V ocabulary for universal approximation: A linguistic perspective of mapping compositions. arXiv preprint arXiv:2305.12205, 2023. 3

work page arXiv 2023
[3]

Specializing large language models to simulate survey response distributions for global populations

Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul Röttger, and Daniel Hersh- covich. Specializing large language models to simulate survey response distributions for global populations. arXiv preprint arXiv:2502.07068, 2025. 1

work page arXiv 2025
[4]

Stan: A probabilistic programming language

Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of statistical software, 76:1–32, 2017. 1

work page 2017
[5]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Do llms play dice? exploring probability distribution sampling in large language models for behavioral simulation

Jia Gu, Liang Pang, Huawei Shen, and Xueqi Cheng. Do llms play dice? exploring probability distribution sampling in large language models for behavioral simulation. arXiv preprint arXiv:2404.09043, 2024. 1, 2

work page arXiv 2024
[7]

Enough coin flips can make llms act bayesian

Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, and David M Chan. Enough coin flips can make llms act bayesian. arXiv preprint arXiv:2503.04722, 2025. 1, 2, 4, 5

work page arXiv 2025
[8]

Can llms generate random numbers? evaluating llm sampling in controlled domains

Aspen K Hopkins and Alex Renda. Can llms generate random numbers? evaluating llm sampling in controlled domains. In Sampling and Optimization in Discrete Space (SODS) ICML 2023 Workshop, 2023. 3, 4, 5

work page 2023
[9]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 4, 19

work page 2023
[10]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Benchmarking distributional alignment of large language models

Nicole Meister, Carlos Guestrin, and Tatsunori Hashimoto. Benchmarking distributional alignment of large language models. arXiv preprint arXiv:2411.05403, 2024. 1, 3

work page arXiv 2024
[12]

Llm processes: Numerical predictive distributions conditioned on natural language

James Requeima, John Bronskill, Dami Choi, Richard Turner, and David K Duvenaud. Llm processes: Numerical predictive distributions conditioned on natural language. In NeurIPS,

work page
[13]

Monte Carlo statistical methods , volume 2

Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods , volume 2. Springer, 1999. 1

work page 1999
[14]

How random is random? evaluating the random- ness and humaness of llms’ coin flips

Katherine Van Koevering and Jon Kleinberg. How random is random? evaluating the random- ness and humaness of llms’ coin flips. arXiv preprint arXiv:2406.00092, 2024. 2, 5

work page arXiv 2024
[15]

Evaluating binary decision biases in large language models: Implications for fair agent-based financial simulations

Alicia Vidler and Toby Walsh. Evaluating binary decision biases in large language models: Implications for fair agent-based financial simulations. arXiv preprint arXiv:2501.16356, 2025. 1

work page arXiv 2025
[16]

Statistical decision functions

Abraham Wald. Statistical decision functions. The Annals of Mathematical Statistics, pages 165–205, 1949. 1

work page 1949
[17]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. 4, 5 10

work page 2022
[18]

Verbalized machine learning: Revisiting machine learning with language models.Transactions on Machine Learning Research, 2025

Tim Z Xiao, Robert Bamler, Bernhard Schölkopf, and Weiyang Liu. Verbalized machine learning: Revisiting machine learning with language models.Transactions on Machine Learning Research, 2025. 3

work page 2025
[19]

Application of large language models in stochastic sampling algorithms for predictive modeling of population behavior

Yongjian Xu, Akash Nandi, and Evangelos Markopoulos. Application of large language models in stochastic sampling algorithms for predictive modeling of population behavior. Artificial Intelligence and Social Computing, 122:10–20, 2024. 1

work page 2024
[20]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 5 11 Appendix for VRS Table of Contents A Biased (Rejection) Sampling from Bernoulli Distributions 13 A.1 Total Variation Distance . . . . . . . . . . . . . . . ....

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

heads or tails?

Maya Bar-Hillel, Eyal Peer, and Alessandro Acquisti. “heads or tails?”—a reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition , 40(6):1656, 2014. 5

work page 2014

[2] [2]

V ocabulary for universal approximation: A linguistic perspective of mapping compositions

Yongqiang Cai. V ocabulary for universal approximation: A linguistic perspective of mapping compositions. arXiv preprint arXiv:2305.12205, 2023. 3

work page arXiv 2023

[3] [3]

Specializing large language models to simulate survey response distributions for global populations

Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul Röttger, and Daniel Hersh- covich. Specializing large language models to simulate survey response distributions for global populations. arXiv preprint arXiv:2502.07068, 2025. 1

work page arXiv 2025

[4] [4]

Stan: A probabilistic programming language

Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of statistical software, 76:1–32, 2017. 1

work page 2017

[5] [5]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Do llms play dice? exploring probability distribution sampling in large language models for behavioral simulation

Jia Gu, Liang Pang, Huawei Shen, and Xueqi Cheng. Do llms play dice? exploring probability distribution sampling in large language models for behavioral simulation. arXiv preprint arXiv:2404.09043, 2024. 1, 2

work page arXiv 2024

[7] [7]

Enough coin flips can make llms act bayesian

Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, and David M Chan. Enough coin flips can make llms act bayesian. arXiv preprint arXiv:2503.04722, 2025. 1, 2, 4, 5

work page arXiv 2025

[8] [8]

Can llms generate random numbers? evaluating llm sampling in controlled domains

Aspen K Hopkins and Alex Renda. Can llms generate random numbers? evaluating llm sampling in controlled domains. In Sampling and Optimization in Discrete Space (SODS) ICML 2023 Workshop, 2023. 3, 4, 5

work page 2023

[9] [9]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 4, 19

work page 2023

[10] [10]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Benchmarking distributional alignment of large language models

Nicole Meister, Carlos Guestrin, and Tatsunori Hashimoto. Benchmarking distributional alignment of large language models. arXiv preprint arXiv:2411.05403, 2024. 1, 3

work page arXiv 2024

[12] [12]

Llm processes: Numerical predictive distributions conditioned on natural language

James Requeima, John Bronskill, Dami Choi, Richard Turner, and David K Duvenaud. Llm processes: Numerical predictive distributions conditioned on natural language. In NeurIPS,

work page

[13] [13]

Monte Carlo statistical methods , volume 2

Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods , volume 2. Springer, 1999. 1

work page 1999

[14] [14]

How random is random? evaluating the random- ness and humaness of llms’ coin flips

Katherine Van Koevering and Jon Kleinberg. How random is random? evaluating the random- ness and humaness of llms’ coin flips. arXiv preprint arXiv:2406.00092, 2024. 2, 5

work page arXiv 2024

[15] [15]

Evaluating binary decision biases in large language models: Implications for fair agent-based financial simulations

Alicia Vidler and Toby Walsh. Evaluating binary decision biases in large language models: Implications for fair agent-based financial simulations. arXiv preprint arXiv:2501.16356, 2025. 1

work page arXiv 2025

[16] [16]

Statistical decision functions

Abraham Wald. Statistical decision functions. The Annals of Mathematical Statistics, pages 165–205, 1949. 1

work page 1949

[17] [17]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. 4, 5 10

work page 2022

[18] [18]

Verbalized machine learning: Revisiting machine learning with language models.Transactions on Machine Learning Research, 2025

Tim Z Xiao, Robert Bamler, Bernhard Schölkopf, and Weiyang Liu. Verbalized machine learning: Revisiting machine learning with language models.Transactions on Machine Learning Research, 2025. 3

work page 2025

[19] [19]

Application of large language models in stochastic sampling algorithms for predictive modeling of population behavior

Yongjian Xu, Akash Nandi, and Evangelos Markopoulos. Application of large language models in stochastic sampling algorithms for predictive modeling of population behavior. Artificial Intelligence and Social Computing, 122:10–20, 2024. 1

work page 2024

[20] [20]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 5 11 Appendix for VRS Table of Contents A Biased (Rejection) Sampling from Bernoulli Distributions 13 A.1 Total Variation Distance . . . . . . . . . . . . . . . ....

work page internal anchor Pith review Pith/arXiv arXiv 2024