pith. machine review for the scientific record. sign in

arxiv: 2601.05414 · v3 · submitted 2026-01-08 · 💻 cs.CL · cs.AI· stat.ML

Recognition: no theorem link

Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIstat.ML
keywords large language modelsrandom samplingstatistical distributionsprobabilistic generationLLM evaluationsampling biasstochastic processes
0
0 comments X

The pith

Large language models cannot reliably generate random samples from statistical distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper audits whether large language models can produce numbers drawn from specified probability distributions. Experiments on eleven models and fifteen distributions reveal that generating one thousand samples in a single response passes statistical checks only seven percent of the time on average, while requesting each sample independently causes ten of the eleven models to fail every distribution. Performance worsens as the distributions become more complex and as the requested sample size grows. These sampling shortfalls then create measurable biases when the models are used for tasks such as creating multiple-choice questions or synthesizing image prompts with demographic targets. The results indicate that LLMs lack a working internal mechanism for probabilistic sampling and must rely on external tools for any application that requires statistical guarantees.

Core claim

Frontier large language models lack a functional internal sampler for probability distributions. This is shown by a median 7 percent pass rate under batch generation of N=1000 samples and by near-total failure (10 of 11 models passing zero distributions) under independent requests. Sampling fidelity declines monotonically with distributional complexity and with increasing N. The same deficiencies produce systematic biases in downstream applications, including non-uniform answer-position distributions in multiple-choice question generation and violations of demographic targets in constrained text-to-image prompt synthesis.

What carries the argument

Dual-protocol audit (batch generation versus independent requests) that measures statistical validity of samples drawn from 15 distributions.

Load-bearing premise

The chosen statistical tests and pass-rate thresholds accurately measure whether an LLM can serve as a reliable sampler in downstream applications.

What would settle it

Demonstration of a single model that passes every statistical test for all 15 distributions when producing 1000 independent samples per distribution.

read the original abstract

As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces $N{=}1000$ samples within one response, and Independent Requests, comprising $N{=}1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 7% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the sampling horizon $N$ increases. Finally, we demonstrate how the propagation of these failures into downstream real-world application tasks introduces systematic biases: models fail to enforce uniform answer-position constraints in Multiple Choice Question generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript audits native probabilistic sampling in 11 frontier LLMs across 15 distributions using a dual-protocol design: batch generation of N=1000 samples in a single response versus 1000 independent stateless requests. It reports a 7% median pass rate for batch mode, near-total collapse (10 of 11 models passing zero distributions) in independent mode, monotonic degradation with distributional complexity and increasing N, and downstream biases such as violated uniform answer-position constraints in MCQ generation and demographic target violations in text-to-image prompt synthesis. The central claim is that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.

Significance. If the empirical results hold after addressing methodological gaps, the work is significant for AI systems engineering because it quantifies a practical barrier to deploying LLMs in stochastic pipelines, simulation, or fairness-critical tasks. The protocol asymmetry and downstream propagation examples supply concrete, falsifiable benchmarks that could inform when practitioners must wrap LLMs with external RNGs or samplers rather than relying on native generation.

major comments (3)
  1. [Dual-protocol design] Dual-protocol design: the central claim that failures demonstrate absence of an internal sampler (rather than instruction-following limits) rests on fixed prompt templates with no reported ablations on phrasing, few-shot exemplars of correct sampling, or explicit randomness instructions. If pass rates exceed 50% under varied prompts while holding temperature and N fixed, the 7% median and near-zero independent-request results would reflect prompt sensitivity instead, weakening the 'necessitating external tools' conclusion.
  2. [Results] Results and statistical tests: the abstract states concrete pass rates and monotonic trends but provides no explicit description of the exact statistical tests (e.g., Kolmogorov-Smirnov, chi-squared), p-value thresholds, or multiple-comparison corrections used to declare a 'pass.' Without these details the 7% median cannot be evaluated for robustness against post-hoc threshold choices.
  3. [Downstream applications] Downstream tasks: the reported biases in MCQ position constraints and demographic targets in image prompts are presented as propagation of sampling failures, yet the manuscript does not quantify how much of the bias is attributable to sampling versus other generation artifacts (e.g., position bias in training data). This link is load-bearing for the practical implications.
minor comments (2)
  1. [Methods] The manuscript should include a table listing all 15 distributions, their parameters, and the precise pass/fail criteria applied to each.
  2. [Figures] Figure captions for degradation plots should explicitly state the statistical test and sample size N used at each point.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we intend to make in the updated version.

read point-by-point responses
  1. Referee: [Dual-protocol design] Dual-protocol design: the central claim that failures demonstrate absence of an internal sampler (rather than instruction-following limits) rests on fixed prompt templates with no reported ablations on phrasing, few-shot exemplars of correct sampling, or explicit randomness instructions. If pass rates exceed 50% under varied prompts while holding temperature and N fixed, the 7% median and near-zero independent-request results would reflect prompt sensitivity instead, weakening the 'necessitating external tools' conclusion.

    Authors: We agree that exploring prompt variations would strengthen the robustness of our findings. Our original prompts were intentionally minimal to isolate the sampling capability without additional scaffolding. Nevertheless, we will include ablations with alternative phrasings, few-shot examples, and explicit instructions for randomness in the revised manuscript. Initial explorations indicate that these modifications do not substantially improve performance, reinforcing our conclusion that the issue is not merely one of instruction following but a lack of an internal sampling mechanism. revision: yes

  2. Referee: [Results] Results and statistical tests: the abstract states concrete pass rates and monotonic trends but provides no explicit description of the exact statistical tests (e.g., Kolmogorov-Smirnov, chi-squared), p-value thresholds, or multiple-comparison corrections used to declare a 'pass.' Without these details the 7% median cannot be evaluated for robustness against post-hoc threshold choices.

    Authors: We acknowledge the need for greater transparency in our statistical methodology. The revised manuscript will include a new subsection detailing the statistical tests used: the Kolmogorov-Smirnov test for continuous distributions and the chi-squared goodness-of-fit test for discrete ones. We used a significance threshold of p < 0.05 with Bonferroni correction applied for the multiple comparisons across distributions and models. This information will be added to the Methods section to allow full evaluation of the results. revision: yes

  3. Referee: [Downstream applications] Downstream tasks: the reported biases in MCQ position constraints and demographic targets in image prompts are presented as propagation of sampling failures, yet the manuscript does not quantify how much of the bias is attributable to sampling versus other generation artifacts (e.g., position bias in training data). This link is load-bearing for the practical implications.

    Authors: This comment highlights an important distinction. While our experiments show that the sampling deficiencies directly contribute to the observed biases in the downstream tasks, fully disentangling the contribution from other sources such as positional biases in the training data would require additional controlled studies. In the revision, we will expand the discussion to note this limitation and provide qualitative evidence linking the sampling failures to the biases. We maintain that the propagation effect is evident from the experimental design, but we will temper the claims accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct statistical measurements

full rationale

The paper conducts a large-scale empirical audit of LLM sampling behavior across 15 distributions using two protocols (batch generation and independent requests), reporting pass rates, monotonic degradation trends, and downstream bias examples. No derivations, equations, fitted parameters, predictions, or self-citations appear in the provided text; all claims rest on direct comparison of generated samples to known distribution properties via standard statistical tests. This structure is self-contained against external benchmarks and contains none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that standard statistical goodness-of-fit tests are sufficient to diagnose whether an LLM possesses an internal sampler, plus the domain assumption that LLMs should be expected to sample faithfully from prompted distributions without external tools.

axioms (1)
  • domain assumption Standard statistical tests (e.g., Kolmogorov-Smirnov, chi-squared) are appropriate and sufficient to evaluate whether LLM outputs match the target distribution.
    Invoked when defining pass rates for the 15 distributions.

pith-pipeline@v0.9.0 · 5541 in / 1181 out tokens · 27363 ms · 2026-05-16T15:43:20.684181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions

    cs.CL 2026-03 unverdicted novelty 7.0

    Language models have an intrinsic randomness floor: transformers show ~0.30 entropic deviation from uniform on neutral prompts, accounting for 88-93% of observed non-randomness, while state-space models exhibit twice ...

  2. Probabilistic Calibration Is a Trainable Capability in Language Models

    cs.CL 2026-05 conditional novelty 5.0

    Fine-tuning language models on synthetic distribution-sampling prompts improves their ability to generate outputs that match target probability distributions on held-out cases.