Recognition: no theorem link
Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions
Pith reviewed 2026-05-16 15:43 UTC · model grok-4.3
The pith
Large language models cannot reliably generate random samples from statistical distributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Frontier large language models lack a functional internal sampler for probability distributions. This is shown by a median 7 percent pass rate under batch generation of N=1000 samples and by near-total failure (10 of 11 models passing zero distributions) under independent requests. Sampling fidelity declines monotonically with distributional complexity and with increasing N. The same deficiencies produce systematic biases in downstream applications, including non-uniform answer-position distributions in multiple-choice question generation and violations of demographic targets in constrained text-to-image prompt synthesis.
What carries the argument
Dual-protocol audit (batch generation versus independent requests) that measures statistical validity of samples drawn from 15 distributions.
Load-bearing premise
The chosen statistical tests and pass-rate thresholds accurately measure whether an LLM can serve as a reliable sampler in downstream applications.
What would settle it
Demonstration of a single model that passes every statistical test for all 15 distributions when producing 1000 independent samples per distribution.
read the original abstract
As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces $N{=}1000$ samples within one response, and Independent Requests, comprising $N{=}1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 7% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the sampling horizon $N$ increases. Finally, we demonstrate how the propagation of these failures into downstream real-world application tasks introduces systematic biases: models fail to enforce uniform answer-position constraints in Multiple Choice Question generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript audits native probabilistic sampling in 11 frontier LLMs across 15 distributions using a dual-protocol design: batch generation of N=1000 samples in a single response versus 1000 independent stateless requests. It reports a 7% median pass rate for batch mode, near-total collapse (10 of 11 models passing zero distributions) in independent mode, monotonic degradation with distributional complexity and increasing N, and downstream biases such as violated uniform answer-position constraints in MCQ generation and demographic target violations in text-to-image prompt synthesis. The central claim is that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.
Significance. If the empirical results hold after addressing methodological gaps, the work is significant for AI systems engineering because it quantifies a practical barrier to deploying LLMs in stochastic pipelines, simulation, or fairness-critical tasks. The protocol asymmetry and downstream propagation examples supply concrete, falsifiable benchmarks that could inform when practitioners must wrap LLMs with external RNGs or samplers rather than relying on native generation.
major comments (3)
- [Dual-protocol design] Dual-protocol design: the central claim that failures demonstrate absence of an internal sampler (rather than instruction-following limits) rests on fixed prompt templates with no reported ablations on phrasing, few-shot exemplars of correct sampling, or explicit randomness instructions. If pass rates exceed 50% under varied prompts while holding temperature and N fixed, the 7% median and near-zero independent-request results would reflect prompt sensitivity instead, weakening the 'necessitating external tools' conclusion.
- [Results] Results and statistical tests: the abstract states concrete pass rates and monotonic trends but provides no explicit description of the exact statistical tests (e.g., Kolmogorov-Smirnov, chi-squared), p-value thresholds, or multiple-comparison corrections used to declare a 'pass.' Without these details the 7% median cannot be evaluated for robustness against post-hoc threshold choices.
- [Downstream applications] Downstream tasks: the reported biases in MCQ position constraints and demographic targets in image prompts are presented as propagation of sampling failures, yet the manuscript does not quantify how much of the bias is attributable to sampling versus other generation artifacts (e.g., position bias in training data). This link is load-bearing for the practical implications.
minor comments (2)
- [Methods] The manuscript should include a table listing all 15 distributions, their parameters, and the precise pass/fail criteria applied to each.
- [Figures] Figure captions for degradation plots should explicitly state the statistical test and sample size N used at each point.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we intend to make in the updated version.
read point-by-point responses
-
Referee: [Dual-protocol design] Dual-protocol design: the central claim that failures demonstrate absence of an internal sampler (rather than instruction-following limits) rests on fixed prompt templates with no reported ablations on phrasing, few-shot exemplars of correct sampling, or explicit randomness instructions. If pass rates exceed 50% under varied prompts while holding temperature and N fixed, the 7% median and near-zero independent-request results would reflect prompt sensitivity instead, weakening the 'necessitating external tools' conclusion.
Authors: We agree that exploring prompt variations would strengthen the robustness of our findings. Our original prompts were intentionally minimal to isolate the sampling capability without additional scaffolding. Nevertheless, we will include ablations with alternative phrasings, few-shot examples, and explicit instructions for randomness in the revised manuscript. Initial explorations indicate that these modifications do not substantially improve performance, reinforcing our conclusion that the issue is not merely one of instruction following but a lack of an internal sampling mechanism. revision: yes
-
Referee: [Results] Results and statistical tests: the abstract states concrete pass rates and monotonic trends but provides no explicit description of the exact statistical tests (e.g., Kolmogorov-Smirnov, chi-squared), p-value thresholds, or multiple-comparison corrections used to declare a 'pass.' Without these details the 7% median cannot be evaluated for robustness against post-hoc threshold choices.
Authors: We acknowledge the need for greater transparency in our statistical methodology. The revised manuscript will include a new subsection detailing the statistical tests used: the Kolmogorov-Smirnov test for continuous distributions and the chi-squared goodness-of-fit test for discrete ones. We used a significance threshold of p < 0.05 with Bonferroni correction applied for the multiple comparisons across distributions and models. This information will be added to the Methods section to allow full evaluation of the results. revision: yes
-
Referee: [Downstream applications] Downstream tasks: the reported biases in MCQ position constraints and demographic targets in image prompts are presented as propagation of sampling failures, yet the manuscript does not quantify how much of the bias is attributable to sampling versus other generation artifacts (e.g., position bias in training data). This link is load-bearing for the practical implications.
Authors: This comment highlights an important distinction. While our experiments show that the sampling deficiencies directly contribute to the observed biases in the downstream tasks, fully disentangling the contribution from other sources such as positional biases in the training data would require additional controlled studies. In the revision, we will expand the discussion to note this limitation and provide qualitative evidence linking the sampling failures to the biases. We maintain that the propagation effect is evident from the experimental design, but we will temper the claims accordingly. revision: partial
Circularity Check
No circularity: purely empirical benchmarking with direct statistical measurements
full rationale
The paper conducts a large-scale empirical audit of LLM sampling behavior across 15 distributions using two protocols (batch generation and independent requests), reporting pass rates, monotonic degradation trends, and downstream bias examples. No derivations, equations, fitted parameters, predictions, or self-citations appear in the provided text; all claims rest on direct comparison of generated samples to known distribution properties via standard statistical tests. This structure is self-contained against external benchmarks and contains none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard statistical tests (e.g., Kolmogorov-Smirnov, chi-squared) are appropriate and sufficient to evaluate whether LLM outputs match the target distribution.
Forward citations
Cited by 2 Pith papers
-
The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions
Language models have an intrinsic randomness floor: transformers show ~0.30 entropic deviation from uniform on neutral prompts, accounting for 88-93% of observed non-randomness, while state-space models exhibit twice ...
-
Probabilistic Calibration Is a Trainable Capability in Language Models
Fine-tuning language models on synthetic distribution-sampling prompts improves their ability to generate outputs that match target probability distributions on held-out cases.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.