Universal Self-Consistency for Large Language Model Generation

Charles Sutton; Denny Zhou; Jie Ren; Kefan Xiao; Pengcheng Yin; Renat Aksitov; Sushant Prakash; Uri Alon; Xinyun Chen; Xuezhi Wang

arxiv: 2311.17311 · v1 · pith:V6UMKMD6new · submitted 2023-11-29 · 💻 cs.CL · cs.AI

Universal Self-Consistency for Large Language Model Generation

Xinyun Chen , Renat Aksitov , Uri Alon , Jie Ren , Kefan Xiao , Pengcheng Yin , Sushant Prakash , Charles Sutton

show 2 more authors

Xuezhi Wang Denny Zhou

This is my paper

classification 💻 cs.CL cs.AI

keywords self-consistencygenerationmultipleperformanceanswerreasoningapplicablecode

0 comments

read the original abstract

Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Consistency (USC), which leverages LLMs themselves to select the most consistent answer among multiple candidates. We evaluate USC on a variety of benchmarks, including mathematical reasoning, code generation, long-context summarization, and open-ended question answering. On open-ended generation tasks where the original self-consistency method is not applicable, USC effectively utilizes multiple samples and improves the performance. For mathematical reasoning, USC matches the standard self-consistency performance without requiring the answer formats to be similar. Finally, without access to execution results, USC also matches the execution-based voting performance on code generation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation
cs.LG 2026-06 unverdicted novelty 7.0

10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.
MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
cs.AI 2026-06 unverdicted novelty 7.0

MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.
Agreement in Representation Space for Open-Ended Self-Consistency
cs.CL 2026-06 unverdicted novelty 7.0

EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.
ATLAS: Agentic Test-time Learning-to-Allocate Scaling
cs.LG 2026-06 unverdicted novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling
cs.LG 2026-05 unverdicted novelty 7.0

ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verif...
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
cs.AI 2026-05 unverdicted novelty 7.0

DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
Regulating Branch Parallelism in LLM Serving
cs.DC 2026-05 unverdicted novelty 7.0

TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while kee...
A Single Patch Is Not Enough: Deterministic Fusion of Repair Candidates
cs.SE 2026-07 unverdicted novelty 6.0

PatchFusion uses deterministic atomic evidence fusion on candidate patches to outperform ranking, test-filtering, and LLM-judge selectors on SWE-bench and Defects4J pools.
Boosting Self-Consistency with Ranking
cs.CL 2026-06 unverdicted novelty 6.0

RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on Q...
FUSE: Ensembling Verifiers with Zero Labeled Data
stat.ML 2026-04 unverdicted novelty 6.0

FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and...
Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework
cs.IR 2026-04 unverdicted novelty 6.0

Small open LLMs produce highly variable medical answers even at low temperature, with self-agreement at most 0.20 and 87-97% unique outputs per model across 10 runs.
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
cs.AI 2026-06 unverdicted novelty 5.0

PPV delegation using letter entropy and per-question embedding cosine beats majority voting by 1.5 pp overall on MMLU-Pro in an unsupervised setting.
ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling
cs.AI 2026-05 unverdicted novelty 5.0

ExComm adds cross-agent conflict detection and soft belief correction plus trajectory diversification to agentic test-time scaling, yielding 5-6% gains over baselines on AIME and GAIA benchmarks.
A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement
cs.CL 2025-07 unverdicted novelty 5.0

SMCS coordinates 15 open-source LLMs via retrieval-based prior selection and exploration-exploitation posterior enhancement, outperforming GPT-4.1 by 5.36% and GPT-o3-mini by 5.28% on eight benchmarks.
Improving Language Models with Intentional Analysis
cs.CL 2025-02 unverdicted novelty 5.0

Intentional Analysis improves language model task performance by explicitly adding intent-aware analysis and reasoning, outperforming Chain-of-Thought and working synergistically with it even on frontier models.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
cs.LG 2026-04 unverdicted novelty 4.0

HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...