Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers

Jingkai Huang; Will Ma; Zhengyuan Zhou

arxiv: 2602.05395 · v2 · pith:VM6WURLJnew · submitted 2026-02-05 · 📊 stat.ML · cs.AI· cs.LG

Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers

Jingkai Huang , Will Ma , Zhengyuan Zhou This is my paper

classification 📊 stat.ML cs.AIcs.LG

keywords answerstoppingaccuracyachievebayesianconsistentcostsefficient

0 comments

read the original abstract

A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the answer most consistently reached. In this paper we leverage Bayesian prior information to save on sampling costs, stopping once sufficient consistency is reached. Although the exact posterior is computationally intractable, we further introduce an efficient "L-aggregated" stopping policy that tracks only the L-1 most frequent answer counts. Theoretically, we prove that L=3 is all you need: this coarse approximation is sufficient to achieve asymptotic optimality, and strictly dominates prior-free baselines, while having a fast posterior computation. Empirically, this identifies the most consistent (i.e., mode) LLM answer using fewer samples, and can achieve similar answer accuracy while cutting the number of LLM calls (i.e., saving on LLM inference costs) by up to 50%.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
cs.AI 2026-06 unverdicted novelty 7.0

MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.
Online Pandora's Box for Contextual LLM Cascading
cs.AI 2026-06 unverdicted novelty 7.0

Introduces a parametric reservation-index policy with GMM estimation and UCB exploration for contextual LLM cascading under output-mediated feedback, claiming dimension-dependent square-root regret.
ATLAS: Agentic Test-time Learning-to-Allocate Scaling
cs.LG 2026-06 unverdicted novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.