pith. sign in

arxiv: 2311.17311 · v1 · pith:V6UMKMD6new · submitted 2023-11-29 · 💻 cs.CL · cs.AI

Universal Self-Consistency for Large Language Model Generation

classification 💻 cs.CL cs.AI
keywords self-consistencygenerationmultipleperformanceanswerreasoningapplicablecode
0
0 comments X
read the original abstract

Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from large language models (LLMs). However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Consistency (USC), which leverages LLMs themselves to select the most consistent answer among multiple candidates. We evaluate USC on a variety of benchmarks, including mathematical reasoning, code generation, long-context summarization, and open-ended question answering. On open-ended generation tasks where the original self-consistency method is not applicable, USC effectively utilizes multiple samples and improves the performance. For mathematical reasoning, USC matches the standard self-consistency performance without requiring the answer formats to be similar. Finally, without access to execution results, USC also matches the execution-based voting performance on code generation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

    cs.LG 2026-06 unverdicted novelty 7.0

    10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.

  2. MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

    cs.AI 2026-06 unverdicted novelty 7.0

    MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.

  3. Agreement in Representation Space for Open-Ended Self-Consistency

    cs.CL 2026-06 unverdicted novelty 7.0

    EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.

  4. ATLAS: Agentic Test-time Learning-to-Allocate Scaling

    cs.LG 2026-06 unverdicted novelty 7.0

    ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.

  5. ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.

  6. PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

    cs.CL 2026-05 unverdicted novelty 7.0

    PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...

  7. CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verif...

  8. Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.

  9. Regulating Branch Parallelism in LLM Serving

    cs.DC 2026-05 unverdicted novelty 7.0

    TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while kee...

  10. A Single Patch Is Not Enough: Deterministic Fusion of Repair Candidates

    cs.SE 2026-07 unverdicted novelty 6.0

    PatchFusion uses deterministic atomic evidence fusion on candidate patches to outperform ranking, test-filtering, and LLM-judge selectors on SWE-bench and Defects4J pools.

  11. Boosting Self-Consistency with Ranking

    cs.CL 2026-06 unverdicted novelty 6.0

    RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on Q...

  12. FUSE: Ensembling Verifiers with Zero Labeled Data

    stat.ML 2026-04 unverdicted novelty 6.0

    FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and...

  13. Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

    cs.IR 2026-04 unverdicted novelty 6.0

    Small open LLMs produce highly variable medical answers even at low temperature, with self-agreement at most 0.20 and 87-97% unique outputs per model across 10 runs.

  14. When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

    cs.AI 2026-06 unverdicted novelty 5.0

    PPV delegation using letter entropy and per-question embedding cosine beats majority voting by 1.5 pp overall on MMLU-Pro in an unsupervised setting.

  15. ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

    cs.AI 2026-05 unverdicted novelty 5.0

    ExComm adds cross-agent conflict detection and soft belief correction plus trajectory diversification to agentic test-time scaling, yielding 5-6% gains over baselines on AIME and GAIA benchmarks.

  16. A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement

    cs.CL 2025-07 unverdicted novelty 5.0

    SMCS coordinates 15 open-source LLMs via retrieval-based prior selection and exploration-exploitation posterior enhancement, outperforming GPT-4.1 by 5.36% and GPT-o3-mini by 5.28% on eight benchmarks.

  17. Improving Language Models with Intentional Analysis

    cs.CL 2025-02 unverdicted novelty 5.0

    Intentional Analysis improves language model task performance by explicitly adding intent-aware analysis and reasoning, outperforming Chain-of-Thought and working synergistically with it even on frontier models.

  18. Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

    cs.LG 2026-04 unverdicted novelty 4.0

    HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...