arxiv: 2605.03379 · v2 · submitted 2026-05-05 · 💻 cs.LG · cs.CL

Recognition: 3 theorem links

· Lean Theorem

Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

Yi Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords repeated LLM inferencemajority votingdistribution-free boundsmoment problemsconditional i.i.d.test-time computelatent success probabilityvote-accuracy curve

0 comments

The pith

Two labeled calls suffice to produce sharp distribution-free bounds on majority-vote accuracy for any LLM sampling budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Repeated LLM sampling improves accuracy only to the extent that latent per-example correctness probabilities vary across inputs. Two calls recover the first two moments of this latent distribution and therefore the correlation between same-example calls. These moments determine exact intervals for the accuracy achieved by majority voting at any fixed budget. The intervals are distribution-free because the underlying moment problem admits three-atom extremal distributions and quadratic dual certificates. Experiments on QNLI and QQP confirm that observed three- and five-vote accuracies lie inside the intervals computed from the first two calls alone.

Core claim

From the first two moments of the latent success probability q, every fixed majority-vote budget has a sharp distribution-free interval. The infinite-dimensional moment problem is solved exactly by three-atom extremizers and quadratic dual certificates for each finite budget. The three-vote case has closed form with width at most 1/8 and a certified-improvement criterion; the infinite-vote endpoint is also bounded yet remains sensitive to latent mass near q = 1/2. Maximum-entropy and latent-difficulty Gaussian-probit completions can be added, and empirical voting accuracies on QNLI and QQP fall inside the projected two-call regions.

What carries the argument

The two-moment problem for the binary correctness layer under conditional i.i.d. sampling, solved via three-atom extremal distributions and quadratic dual certificates.

If this is right

Every majority-vote budget receives exact two-call bounds.
Three-vote accuracy has closed form and a certified-improvement test.
Infinite-vote accuracy is sharply bounded but threshold-sensitive.
Maximum-entropy and Gaussian-probit point completions tighten the intervals.
Observed accuracies on QNLI and QQP remain inside the projected regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Two-call probes could guide allocation of test-time compute without requiring full distributional knowledge.
Temperature changes or model mixtures can produce voting gains not ordered by single-call accuracy.
The same two-moment reduction may apply to non-binary or multi-class correctness layers.
Practitioners could use the certified-improvement criterion to decide when extra votes are worthwhile on a given task.

Load-bearing premise

Repeated calls are conditionally independent and identically distributed given the latent per-example success probability q.

What would settle it

Compute the two-call moments on a dataset, derive the predicted interval for three-vote accuracy, then perform three-vote inference on the same dataset and verify whether the observed accuracy falls outside the interval.

Figures

Figures reproduced from arXiv: 2605.03379 by Yi Liu.

**Figure 1.** Figure 1: Identified width after two labeled calls. Left: exact width view at source ↗

**Figure 2.** Figure 2: Policies in the two-call votability plane. Colors denote model families and marker shapes denote view at source ↗

**Figure 3.** Figure 3: Randomized mixture policies in the two-call votability plane. Grey points are mixture-grid policies; view at source ↗

**Figure 3.** Figure 3: visualizes the certified three-vote gain ∆cert 1 = L1 − µ over the feasible two-call moment region. The zero contour is exactly the theorem’s improvement boundary ρ = 1 − 1/(2µ) for µ > 1/2 view at source ↗

**Figure 4.** Figure 4: Randomized mixture policies in the two-call votability plane. Grey points are mixture-grid policies; view at source ↗

read the original abstract

Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls. One labeled call identifies the mean latent success probability; two labeled calls identify its second moment and hence the same-example correctness correlation that separates stable errors from recoverable call-level randomness. From these two moments, every fixed majority-vote budget has a sharp distribution-free two-call interval. The key technical reduction is that the infinite-dimensional moment problem has three-atom extremizers and quadratic dual certificates for every finite budget, so the bounds are exact rather than discretized or parametric. The first useful budget, three votes, has a closed form, width at most $1/8$, and a certified-improvement criterion. The infinite-vote endpoint is the limit of majority voting as the number of calls tends to infinity; it is also sharply bounded, but remains threshold-sensitive because it depends on latent mass around $q=1/2$. We add maximum-entropy and Latent-difficulty Gaussian-probit point completions, and experiments on LLM calls over QNLI and QQP show that empirical three- and five-vote accuracies are contained in the projected two-call regions while temperature changes and randomized model mixtures can create voting gains not ordered by one-call accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two calls give exact distribution-free intervals for majority-vote accuracy via a three-atom moment reduction.

read the letter

The main takeaway is that two labeled LLM calls identify the first two moments of the latent per-example success probability and from those moments you can compute sharp, non-parametric intervals for the accuracy of any fixed majority-vote budget. The technical step is a reduction of the bounding problem to a moment problem on [0,1] whose extremal distributions are three-atom measures, certified by quadratic duals, so the intervals are exact rather than relaxed or parametric. The three-vote case even has a closed form whose width is at most 1/8 together with a simple certified-improvement test. The infinite-vote limit is also bounded, though it stays sensitive to mass near q=1/2. Experiments on QNLI and QQP show that observed three- and five-vote accuracies fall inside the projected intervals, and they illustrate that temperature or mixture changes can produce voting gains not ordered by single-call accuracy. The maximum-entropy and Gaussian-probit completions are kept as separate point estimates, which avoids conflating them with the bounds. The construction is distribution-free inside the conditional-i.i.d. model and does not rely on fitted parameters or circular definitions. The main limitation is the modeling assumption itself: calls must be conditionally independent and identically distributed given the latent q. If real calls carry extra dependence, the actual accuracy can fall outside the interval. The paper is clear about this and about the threshold sensitivity of the infinite-vote bound. I see no load-bearing gaps in the described reduction, but the dual-certificate proofs would need checking in the full text. This is useful for anyone deciding how much test-time compute to spend on repeated sampling. A practitioner can plug in two calls and get immediate interval estimates for larger budgets. It is worth sending to peer review because the moment reduction is new and the empirical containment check is direct; referees will mainly press on the i.i.d. assumption and ask for broader datasets, but the core result looks publishable after modest revision.

Referee Report

2 major / 2 minor

Summary. The paper claims that, under the conditional-i.i.d. model for repeated LLM calls given a latent per-example success probability q, two labeled calls suffice to identify the first two moments of the distribution of q. These moments determine sharp, distribution-free intervals for the accuracy of majority voting at any fixed budget via reduction to an infinite-dimensional moment problem on [0,1]. The reduction is asserted to admit three-atom extremizers together with quadratic dual certificates, yielding exact (non-relaxed) bounds. A closed-form expression is given for the three-vote case (width at most 1/8) along with a certified-improvement criterion; the infinite-vote limit is also bounded but remains sensitive to mass near q=1/2. Maximum-entropy and Gaussian-probit completions are supplied as point estimates, and experiments on QNLI and QQP are reported to show that empirical three- and five-vote accuracies lie inside the projected two-call intervals.

Significance. If the central reduction holds, the work supplies a principled, low-cost method for quantifying the value of repeated sampling in LLMs by separating stable errors from recoverable call-level noise. The distribution-free character of the intervals, their exactness via three-atom extremal measures, and the closed-form three-vote result constitute a clear technical advance over parametric or simulation-based approaches. The empirical containment on standard benchmarks and the analysis of the infinite-vote endpoint add practical relevance for test-time compute allocation.

major comments (2)

[moment-problem reduction (abstract, §4)] The central technical claim (abstract and §4) is that the moment problem admits three-atom extremizers and quadratic dual certificates for every finite budget, delivering exact rather than relaxed bounds. The manuscript must exhibit the explicit dual-certificate construction (including the quadratic form and the verification that it certifies the optimum for the majority-vote objective) so that the asserted exactness can be checked; without this, the reduction remains a statement rather than a demonstrated result.
[three-vote case (abstract, §5)] The three-vote closed form (abstract) is stated to have width at most 1/8 and to supply a certified-improvement criterion. The derivation of this closed form from the three-atom extremizer must be supplied in full, together with the algebraic verification that the width bound holds uniformly over all feasible first- and second-moment pairs.

minor comments (2)

[experiments] The projection of the two-call moment intervals onto finite-budget accuracies (experiments section) should include an explicit algorithmic description or pseudocode for how the bounds are computed from the observed moments, to facilitate reproduction.
[preliminaries] Notation for the latent variable q and the majority-vote accuracy functional should be introduced once in a dedicated preliminary section and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and the recommendation of minor revision. The comments highlight important points for strengthening the presentation of the technical results. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [moment-problem reduction (abstract, §4)] The central technical claim (abstract and §4) is that the moment problem admits three-atom extremizers and quadratic dual certificates for every finite budget, delivering exact rather than relaxed bounds. The manuscript must exhibit the explicit dual-certificate construction (including the quadratic form and the verification that it certifies the optimum for the majority-vote objective) so that the asserted exactness can be checked; without this, the reduction remains a statement rather than a demonstrated result.

Authors: We agree that providing the explicit dual-certificate construction is essential to substantiate the claim of exact bounds. In the revised version, we will augment Section 4 with a detailed construction of the quadratic dual certificate for the majority-vote objective. This will include the specific quadratic form in terms of the moments and a verification step showing that the dual objective equals the value attained by the three-atom extremal measure for any feasible first and second moments. We believe this will fully demonstrate the exactness of the reduction. revision: yes
Referee: [three-vote case (abstract, §5)] The three-vote closed form (abstract) is stated to have width at most 1/8 and to supply a certified-improvement criterion. The derivation of this closed form from the three-atom extremizer must be supplied in full, together with the algebraic verification that the width bound holds uniformly over all feasible first- and second-moment pairs.

Authors: We will expand the presentation in Section 5 to provide the complete derivation of the closed-form bounds from the three-atom extremizer. This will detail the optimization steps leading to the explicit expressions for the lower and upper bounds on three-vote accuracy. Additionally, we will include the algebraic verification that the width of these bounds is at most 1/8 for all pairs of moments (m1, m2) satisfying the feasibility constraints (i.e., 0 ≤ m2 ≤ m1 ≤ 1 and m2 ≥ m1²). The verification proceeds by parameterizing the feasible region and showing the bound holds by direct (if tedious) computation or by analyzing the extremal cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper identifies the mean and second moment of the latent success probability q from one and two labeled calls respectively, then reduces the problem of bounding majority-vote accuracy for any fixed budget to a moment problem on [0,1]. It asserts that this infinite-dimensional problem admits three-atom extremizers together with quadratic dual certificates, yielding exact (non-relaxed) intervals. This reduction is presented as a technical fact about the moment problem under the stated conditional-i.i.d. model; it does not rename a fitted parameter as a prediction, define a quantity in terms of itself, or rely on a load-bearing self-citation whose content is unverified. The three-vote closed form and infinite-vote endpoint are derived consequences of the same moment reduction rather than inputs. The maximum-entropy and Gaussian-probit completions are explicitly labeled as separate point estimates. No step in the described chain reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the conditional-i.i.d. modeling assumption and the applicability of extremal moment theory; no free parameters are fitted to target accuracies for the main bounds.

axioms (2)

domain assumption Repeated LLM calls are conditionally independent and identically distributed given the latent per-example success probability q
This is the foundational modeling assumption for the binary correctness layer stated in the abstract.
domain assumption Correctness per call is binary
The analysis is restricted to the binary correctness layer.

pith-pipeline@v0.9.0 · 5543 in / 1314 out tokens · 35588 ms · 2026-05-08T18:39:26.928532+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (for contrast) washburn_uniqueness_aczel — not invoked or paralleled; the 1/8 here comes from polynomial moment extremization, not J-cost unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

U_1(μ,ν) − L_1(μ,ν) = 2μ(1−μ)ρ(1−ρ) ≤ 1/8

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Can Voting Help, Hurt, or Change Course? Exact Structure of Binary Test-Time Aggregation
cs.LG 2026-05 unverdicted novelty 7.0

The voting curve from repeated binary predictions is exactly equivalent to a signed voting signature capturing excess latent mass above the majority threshold at binomial variance scales, via signed Hausdorff moments.

Reference graph

Works this paper leans on

30 extracted references · 18 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Let ' s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLM s

P. Aggarwal, A. Madaan, Y . Yang, and Mausam. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12375–12396, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.761. URL https:...

work page doi:10.18653/v1/2023.emnlp-main.761 2023
[2]

J. H. Albert. Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17(3):251–269, 1992. doi: 10.3102/10769986017003251. URLhttps://doi.org/10.3102/10769986017003251

work page doi:10.3102/10769986017003251 1992
[3]

deter- ministic

B. Atıl, S. Aykent, A. Chittams, L. Fu, R. J. Passonneau, E. Radcliffe, G. R. Rajagopal, A. Sloan, T. Tudrej, F. Ture, Z. Wu, L. Xu, and B. Baldwin. Non-determinism of “deter- ministic” LLM system settings in hosted environments. InProceedings of the 5th Work- shop on Evaluation and Comparison of NLP Systems, pages 135–148, Mumbai, India, 2025. Associatio...

work page doi:10.18653/v1/2025.eval4nlp-1.12 2025
[4]

Bertsimas and I

D. Bertsimas and I. Popescu. Optimal inequalities in probability theory: A convex opti- mization approach.SIAM Journal on Optimization, 15(3):780–804, 2005. doi: 10.1137/ S1052623401399903. URLhttps://doi.org/10.1137/S1052623401399903

work page doi:10.1137/s1052623401399903 2005
[5]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

B. Brown, J. Juravsky, R. S. Ehrlich, R. Clark, Q. V . Le, C. Ré, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. doi: 10.48550/arXiv.2407.21787. URL https://arxiv.org/ abs/2407.21787

work page internal anchor Pith review doi:10.48550/arxiv.2407.21787 2024
[6]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page Pith review doi:10.48550/arxiv.2107.03374 2021
[7]

doi: 10.1038/ s41586-024-07421-0

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024. doi: 10.1038/s41586-024-07421-0. URLhttps://doi.org/10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024
[8]

R. J. Gallo, M. Baiocchi, T. R. Savage, and J. H. Chen. Establishing best practices in large language model research: An application to repeat prompting.Journal of the American Medical Informatics Association, 32(2):386–390, 2025. doi: 10.1093/jamia/ocae294. URL https://doi.org/10.1093/jamia/ocae294

work page doi:10.1093/jamia/ocae294 2025
[9]

E. T. Jaynes. Information theory and statistical mechanics.Physical Review, 106(4):620–630,
[10]

doi:10.1103/physrev.106.620 , url =

doi: 10.1103/PhysRev.106.620. URL https://doi.org/10.1103/PhysRev.106. 620

work page doi:10.1103/physrev.106.620
[11]

Karlin and W

S. Karlin and W. J. Studden.Tchebycheff Systems: With Applications in Analysis and Statistics. Interscience Publishers, New York, 1966

1966
[12]

L. Kuhn, Y . Gal, and S. Farquhar. Semantic uncertainty: Linguistic invariances for uncer- tainty estimation in natural language generation. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=VD-AYtP0dve

2023
[13]

F. M. Lord.Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Associates, Hillsdale, NJ, 1980. 10

1980
[14]

Andrey Malinin and Mark Gales

P. Manakul, A. Liusie, and M. J. F. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.557. URL https...

work page doi:10.18653/v1/2023.emnlp-main.557 2023
[15]

R. Nowak. Estimating the self-consistency of LLMs.arXiv preprint arXiv:2509.19489, 2025. doi: 10.48550/arXiv.2509.19489. URLhttps://arxiv.org/abs/2509.19489

work page doi:10.48550/arxiv.2509.19489 2025
[16]

Generate a completion

Ollama. Generate a completion. Documentation, 2026. URL https://docs.ollama.com/ api/generate. Accessed 2026-05-03

2026
[17]

llama3.1:8b model page

Ollama. llama3.1:8b model page. Model documentation, 2026. URL https://ollama.com/ library/llama3.1:8b. Accessed 2026-05-03

2026
[18]

phi4-mini model page

Ollama. phi4-mini model page. Model documentation, 2026. URL https://ollama.com/ library/phi4-mini. Accessed 2026-05-03

2026
[19]

qwen2.5:7b model page

Ollama. qwen2.5:7b model page. Model documentation, 2026. URL https://ollama.com/ library/qwen2.5:7b. Accessed 2026-05-03

2026
[20]

I. Pinelis. On the extreme points of moments sets.Mathematical Methods of Operations Research, 83(3):325–349, 2016. doi: 10.1007/s00186-015-0530-0. URL https://doi.org/ 10.1007/s00186-015-0530-0

work page doi:10.1007/s00186-015-0530-0 2016
[21]

doi:10.18653/v1/D16-1264 , pages =

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URLhttps://aclanthology.org/D16-1264/

work page doi:10.18653/v1/d16-1264 2016
[22]

Savage, J

T. Savage, J. Wang, R. Gallo, A. Boukil, V . Patel, S. A. A. Safavi-Naini, A. Soroush, and J. H. Chen. Large language model uncertainty proxies: Discrimination and calibration for medical diagnosis and treatment.Journal of the American Medical Informatics Association, 32(1):139– 149, 2025. doi: 10.1093/jamia/ocae254. URL https://doi.org/10.1093/jamia/ocae254

work page doi:10.1093/jamia/ocae254 2025
[23]

and Sheffer, T

A. Taubenfeld, T. Sheffer, E. Ofek, A. Feder, A. Goldstein, Z. Gekhman, and G. Yona. Confidence improves self-consistency in LLMs. InFindings of the Association for Com- putational Linguistics: ACL 2025, pages 20090–20111, Vienna, Austria, 2025. Associa- tion for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1030. URL https: //aclanthology...

work page doi:10.18653/v1/2025.findings-acl.1030 2025
[24]

Vashurin, E

R. Vashurin, E. Fadeeva, A. Vazhentsev, L. Rvanova, D. Vasilev, A. Tsvigun, S. Petrakov, R. Xing, A. Sadallah, K. Grishchenkov, A. Panchenko, T. Baldwin, P. Nakov, M. Panov, and A. Shelmanov. Benchmarking uncertainty quantification methods for large language models with LM-polygraph.Transactions of the Association for Computational Linguistics, 13:220–248,
[25]

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM -Polygraph

doi: 10.1162/tacl_a_00737. URLhttps://aclanthology.org/2025.tacl-1.11/

work page doi:10.1162/tacl_a_00737 2025
[26]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task bench- mark and analysis platform for natural language understanding. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7

2019
[27]

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=1PL1NIMMrw

2023
[28]

Z. Wang, J. Duan, L. Cheng, Y . Zhang, Q. Wang, X. Shi, K. Xu, H. T. Shen, and X. Zhu. ConU: Conformal uncertainty in large language models with correctness coverage guarantees. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6886–6898, Miami, Florida, USA, 2024. Association for Computational Linguistics. doi: 10.18653/v1/2...

work page doi:10.18653/v1/2024 2024
[29]

G. Winkler. Extreme points of moment sets.Mathematics of Operations Research, 13(4): 581–587, 1988. doi: 10.1287/moor.13.4.581. URL https://doi.org/10.1287/moor.13. 4.581

work page doi:10.1287/moor.13.4.581 1988
[30]

type": "object

Q. Xiao, D. Bhattacharjya, B. Ganesan, R. Marinescu, K. Mirylenka, N. H. Pham, M. Glass, and J. Lee. The consistency hypothesis in uncertainty quantification for large language models. InProceedings of the Forty-First Conference on Uncertainty in Artificial Intelligence, volume 286 ofProceedings of Machine Learning Research, pages 4636–4651. PMLR, 2025. U...

2025