pith. sign in

arxiv: 2605.15513 · v1 · pith:QZEGLJK3new · submitted 2026-05-15 · 💻 cs.AI

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

Pith reviewed 2026-05-19 15:35 UTC · model grok-4.3

classification 💻 cs.AI
keywords parallel reasoningpairwise verificationadaptive cascadetest-time scalingLLM reasoningverifier efficiencycode generationmath reasoning
0
0 comments X

The pith

CAPS uses a four-stage cascade to adapt evidence and pair selection so pairwise verification costs far less while selecting better answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Parallel reasoning generates many candidate solutions then relies on a verifier to pick the strongest one. Standard pairwise verification always feeds complete solutions to the judge and performs many comparisons even when they add little information. CAPS instead runs a staged process that begins with partial evidence and selective pairs before escalating to full comparisons only where needed. On code and math benchmarks across four models this cuts verifier tokens to roughly a quarter of the leading baseline while winning on most suites and beating pointwise verification everywhere. A reader cares because test-time scaling is one of the main ways to improve LLM reasoning and cheaper verification makes larger candidate pools practical.

Core claim

CAPS is an inference-only framework that allocates verifier compute non-uniformly along an evidence axis adapting how much of each candidate the judge sees and a distribution axis adapting how comparisons are spread across the pool. It instantiates these into a four-stage cascade with an optional rescue subroutine and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models and five reasoning benchmarks spanning code and math, CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code and outperforms pointwise self

What carries the argument

Four-stage cascade with adaptive evidence depth and comparison distribution plus optional rescue subroutine

If this is right

  • Per-candidate marginal verifier cost drops by roughly half compared with uniform full-evidence schedules.
  • Performance exceeds the leading pairwise verifier on 14 of 20 model-benchmark combinations.
  • Performance exceeds pointwise self-verification on all 20 combinations tested.
  • Suitability can be checked in advance by measuring how much the verifier's accuracy changes between partial and full evidence.
  • The optional rescue subroutine recovers from early low-evidence mistakes on some problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Uniform full-evidence pairwise verification wastes tokens on many uninformative comparisons that staged decisions can avoid.
  • The same evidence-and-distribution adaptation could be applied to other aggregation primitives such as majority voting or tree search.
  • Saved tokens could be reinvested in sampling larger candidate pools, which would be expected to raise final accuracy further.
  • The partial-versus-full accuracy diagnostic offers a practical way to decide whether any given verifier is suitable for cascaded use.

Load-bearing premise

The verifier stays accurate enough on partial evidence that early-stage decisions do not permanently eliminate the best candidate.

What would settle it

A set of problems where the verifier's accuracy on partial solutions is low enough that the correct answer is eliminated in stage one or two and the rescue subroutine fails to restore it.

Figures

Figures reproduced from arXiv: 2605.15513 by Fangzhou Lin, Haichong Zhang, Kazunori Yamada, Peiran Li, Qianwen Ge, Shuo Xing, Siyuan Yang, Zhengzhong Tu, Ziming Zhang.

Figure 1
Figure 1. Figure 1: CAPS Overview. Deduplicate, eliminate at partial evidence, eliminate at full evidence, and round-robin among the finalists; with an optional rescue subroutine for cheap-evidence errors. bias, positively rating its own samples even when they are incorrect [28]; and ratings produced for different candidates lack a globally comparable scale, because each judgment is made without reference to any other candida… view at source ↗
Figure 2
Figure 2. Figure 2: On the left: Pass@1 (%), selection accuracy across five benchmarks with pointwise, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier's accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CAPS, a cascaded adaptive pairwise selection framework for efficient parallel reasoning in LLMs. It adapts verifier compute along an evidence axis (partial vs. full solution text) and a distribution axis across a pool of candidates via a four-stage cascade plus optional rescue subroutine. The work derives a closed-form verifier-token cost formula and reports that, across four self-verifying models and five reasoning benchmarks (code and math), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code tasks and outperforms pointwise self-verification on all 20.

Significance. If the central claims hold, the work offers a practical advance in test-time scaling by reducing aggregation cost in parallel reasoning without performance loss. The closed-form cost derivation and the interpretable diagnostic based on partial-versus-full evidence accuracy are strengths that enable predictable budgeting and pre-deployment checks, potentially influencing efficient inference methods for reasoning models.

major comments (3)
  1. [§3.2] §3.2 (Cascade Design): The four-stage cascade with optional rescue relies on the assumption that the self-verifier maintains sufficient accuracy on truncated (partial) evidence to avoid irrecoverable false negatives during early pruning; the manuscript provides no per-stage accuracy curves or ablation isolating the effect of partial-evidence errors on final selection quality, which is load-bearing for the reported token reduction and 14/20 win rate.
  2. [§4.1] §4.1 (Experimental Results): The outperformance claims on 14 of 20 suites and the 25.4% token-budget figure are presented without error bars, multiple-run statistics, or an ablation on cascade depth, making it difficult to assess robustness of the efficiency gains relative to uniform full-evidence pairwise verification.
  3. [§3.3] §3.3 (Cost Formula): The closed-form verifier-token cost derivation claims the per-candidate marginal cost is roughly halved, but the paper does not include a direct table or verification comparing the formula's predictions against the empirically measured token counts from the reported experiments.
minor comments (2)
  1. [Abstract] Abstract: Clarify the precise criterion for 'outperforms' (accuracy, cost, or joint) in the 14/20 claim and list all five benchmarks explicitly rather than summarizing.
  2. [Figure 2] Figure 2 or 3 (Diagnostic Plots): The trade-off suites would be strengthened by overlaying partial-versus-full evidence accuracy to directly illustrate the cascade suitability check mentioned in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your detailed and constructive review of our manuscript. We appreciate the referee's recognition of the potential practical advance in test-time scaling and the strengths of the closed-form cost derivation. We address each major comment below and will incorporate the suggested additions and analyses in the revised version.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Cascade Design): The four-stage cascade with optional rescue relies on the assumption that the self-verifier maintains sufficient accuracy on truncated (partial) evidence to avoid irrecoverable false negatives during early pruning; the manuscript provides no per-stage accuracy curves or ablation isolating the effect of partial-evidence errors on final selection quality, which is load-bearing for the reported token reduction and 14/20 win rate.

    Authors: We agree that explicit per-stage accuracy curves and a targeted ablation would strengthen the justification for the cascade design. The manuscript already provides an interpretable diagnostic based on partial-versus-full evidence accuracy as a pre-deployment check, but we will add per-stage verifier accuracy plots across models and benchmarks in the revision, together with an ablation that measures the impact of early-stage partial-evidence errors on final selection quality and overall token savings. revision: yes

  2. Referee: [§4.1] §4.1 (Experimental Results): The outperformance claims on 14 of 20 suites and the 25.4% token-budget figure are presented without error bars, multiple-run statistics, or an ablation on cascade depth, making it difficult to assess robustness of the efficiency gains relative to uniform full-evidence pairwise verification.

    Authors: We acknowledge that reporting variability and cascade-depth ablations would improve assessment of robustness. Due to compute limits the original experiments used single runs; in the revision we will rerun key suites with multiple random seeds and report means with standard deviations. We will also add an ablation varying cascade depth (2-, 3-, and 4-stage configurations) to quantify the contribution of each stage to the observed efficiency and accuracy gains. revision: yes

  3. Referee: [§3.3] §3.3 (Cost Formula): The closed-form verifier-token cost derivation claims the per-candidate marginal cost is roughly halved, but the paper does not include a direct table or verification comparing the formula's predictions against the empirically measured token counts from the reported experiments.

    Authors: We will add a dedicated verification table in the revised manuscript that directly compares the closed-form cost predictions against the empirically measured verifier token counts for each model-benchmark pair. This table will confirm the claimed reduction (approximately 25.4% of the uniform full-evidence baseline on code tasks) and make the cost model fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation self-contained from cascade structure

full rationale

The paper derives its closed-form verifier-token cost directly from the explicit four-stage cascade structure and optional rescue subroutine, without fitting parameters to data or renaming predictions. No self-definitional loops, fitted inputs called predictions, or load-bearing self-citations appear in the provided derivation chain. The method is framed as an empirical scheduling rule whose marginal-cost halving follows from the non-uniform evidence allocation by construction of the stages, and central performance claims rest on external benchmark results rather than tautological reductions. The verifier accuracy assumption at partial evidence is stated as a pre-deployment check, not smuggled into the equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that partial-evidence judgments remain sufficiently accurate and that the cascade thresholds can be chosen without extensive per-model retuning. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Verifier accuracy at partial evidence is high enough for early cascade stages to be reliable
    Invoked when describing the evidence axis and the decision to use short prefixes in early stages.

pith-pipeline@v0.9.0 · 5849 in / 1336 out tokens · 37336 ms · 2026-05-19T15:35:05.163741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 13 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

  3. [3]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  5. [5]

    ACM Transactions on Intelligent Systems and Technology, 15(3)

    Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023

  6. [6]

    Inference-aware fine-tuning for best-of-n sampling in large language models

    Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust. Inference-aware fine-tuning for best-of-n sampling in large language models.arXiv preprint arXiv:2412.15287, 2024

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  8. [8]

    On the ranking of a swiss system chess team tournament.Annals of Operations Research, 254(1):17–36, 2017

    László Csató. On the ranking of a swiss system chess team tournament.Annals of Operations Research, 254(1):17–36, 2017

  9. [9]

    A language anchor-guided method for robust noisy domain generalization.arXiv preprint arXiv:2503.17211, 2025

    Zilin Dai, Lehong Wang, Fangzhou Lin, Yidong Wang, Zhigang Li, Kazunori D Yamada, Ziming Zhang, and Wang Lu. A language anchor-guided method for robust noisy domain generalization.arXiv preprint arXiv:2503.17211, 2025

  10. [10]

    Round robin classification.Journal of Machine Learning Research, 2(Mar):721–747, 2002

    Johannes Fürnkranz. Round robin classification.Journal of Machine Learning Research, 2(Mar):721–747, 2002

  11. [11]

    Large language models can self-improve

    Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 1051–1068, 2023

  12. [12]

    Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004

    David R Hunter. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004

  13. [13]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  14. [14]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  15. [15]

    KANMixer: a minimal KAN-centered mixer for long-term time series forecasting

    Lingyu Jiang, Yuping Wang, Yao Su, Shuo Xing, Wenjing Chen, Xin Zhang, Zhengzhong Tu, Ziming Zhang, Fangzhou Lin, Michael Zielewski, et al. Kanmixer: Can kan serve as a new modeling core for long-term time series forecasting?arXiv preprint arXiv:2508.01575, 2025

  16. [16]

    TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series Forecasting

    Lingyu Jiang, Lingyu Xu, Peiran Li, Qianwen Ge, Dingyi Zhuang, Shuo Xing, Wenjing Chen, Xiangbo Gao, Ting-Hsuan Chen, Xueying Zhan, et al. Timepre: Bridging accuracy, efficiency, and stability in probabilistic time-series forecasting.arXiv preprint arXiv:2511.18539, 2025. 10

  17. [17]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  18. [18]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  19. [19]

    Let the abyss stare back adaptive falsification for autonomous scientific discovery.arXiv preprint arXiv:2603.29045, 2026

    Peiran Li, Fangzhou Lin, Shuo Xing, Jiashuo Sun, Dylan Zhang, Siyuan Yang, Chaoqun Ni, and Zhengzhong Tu. Let the abyss stare back adaptive falsification for autonomous scientific discovery.arXiv preprint arXiv:2603.29045, 2026

  20. [20]

    Bibagent: An agentic framework for traceable miscitation detection in scientific literature.arXiv preprint arXiv:2601.16993, 2026

    Peiran Li, Fangzhou Lin, Shuo Xing, Xiang Zheng, Xi Hong, Siyuan Yang, Jiashuo Sun, Zhengzhong Tu, and Chaoqun Ni. Bibagent: An agentic framework for traceable miscitation detection in scientific literature.arXiv preprint arXiv:2601.16993, 2026

  21. [21]

    Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents.arXiv preprint arXiv:2603.05517, 2026

    Peiran Li, Jiashuo Sun, Fangzhou Lin, Shuo Xing, Tianfu Fu, Suofei Feng, Chaoqun Ni, and Zhengzhong Tu. Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents.arXiv preprint arXiv:2603.05517, 2026

  22. [22]

    Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

  23. [23]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  24. [24]

    Position: Human-centric ai requires a minimum viable level of human understanding.arXiv preprint arXiv:2602.00854, 2026

    Fangzhou Lin, Qianwen Ge, Lingyu Xu, Peiran Li, Xiangbo Gao, Shuo Xing, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Position: Human-centric ai requires a minimum viable level of human understanding.arXiv preprint arXiv:2602.00854, 2026

  25. [25]

    AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference

    Fangzhou Lin, Peiran Li, Shuo Xing, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Adaptfuse: Training-free sequential preference learning via externalized bayesian inference.arXiv preprint arXiv:2604.03925, 2026

  26. [26]

    Pairwise rm: Perform best-of-n sampling with knockout tournament.arXiv e-prints, pages arXiv–2501, 2025

    Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Pairwise rm: Perform best-of-n sampling with knockout tournament.arXiv e-prints, pages arXiv–2501, 2025

  27. [27]

    Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

  28. [28]

    Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

    Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhut- dinov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

  29. [29]

    Learning adaptive parallel reasoning with language models.arXiv preprint arXiv:2504.15466, 2025

    Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. Learning adaptive parallel reasoning with language models.arXiv preprint arXiv:2504.15466, 2025

  30. [30]

    Plum: Prompt learning using metaheuristics

    Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang Liu, KaShun Shum, Jipeng Zhang, Renjie Pi, and Tong Zhang. Plum: Prompt learning using metaheuristics. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 2177–2197, Bangkok, Thailand, August 2024. Association for Computationa...

  31. [31]

    Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

    Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

  32. [32]

    Recursive introspection: Teaching language model agents how to self-improve.Advances in Neural Information Processing Systems, 37:55249–55285, 2024

    Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. Recursive introspection: Teaching language model agents how to self-improve.Advances in Neural Information Processing Systems, 37:55249–55285, 2024. 11

  33. [33]

    Scalingtest-timecomputewithout verification or rl is suboptimal.arXiv preprint arXiv:2502.12118, 2025

    Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal.arXiv preprint arXiv:2502.12118, 2025

  34. [34]

    Efficient fair queuing using deficit round-robin

    Madhavapeddi Shreedhar and George Varghese. Efficient fair queuing using deficit round-robin. IEEE/ACM Transactions on networking, 4(3):375–385, 1996

  35. [35]

    v 1: Unifying generation and self-verification for parallel reasoners.arXiv preprint arXiv:2603.04304, 2026a

    Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, et al. v_1: Unifying generation and self-verification for parallel reasoners.arXiv preprint arXiv:2603.04304, 2026

  36. [36]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

  37. [37]

    Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

  38. [38]

    On the self-verification limitations of large language models on reasoning and planning tasks.arXiv preprint arXiv:2402.08115, 2024

    Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks.arXiv preprint arXiv:2402.08115, 2024

  39. [39]

    The efficacy of tournament designs.Computers & Operations Research, 144:105821, 2022

    Balázs R Sziklai, Péter Biró, and László Csató. The efficacy of tournament designs.Computers & Operations Research, 144:105821, 2022

  40. [40]

    Confidence improves self-consistency in llms

    Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20090–20111, 2025

  41. [41]

    On the use of random set theory to bracket the results of monte carlo simulations

    Fulvio Tonon. On the use of random set theory to bracket the results of monte carlo simulations. Reliable Computing, 10(2):107–137, 2004

  42. [42]

    Re- cursive self-aggregation unlocks deep thinking in large language models.arXiv preprint arXiv:2509.26626, 2025

    Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, et al. Re- cursive self-aggregation unlocks deep thinking in large language models.arXiv preprint arXiv:2509.26626, 2025

  43. [43]

    Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning

    Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6904–6917, 2025

  44. [44]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  45. [45]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, and Quoc V Le. H. chi, sharan narang, aakanksha chowdhery, and denny zhou. self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, volume 1, page 2, 2023

  46. [46]

    A survey on parallel reasoning.arXiv preprint arXiv:2510.12164, 2025

    Ziqi Wang, Boye Niu, Zipeng Gao, Zhi Zheng, Tong Xu, Linghui Meng, Zhongli Li, Jing Liu, Yilong Chen, Chen Zhu, et al. A survey on parallel reasoning.arXiv preprint arXiv:2510.12164, 2025

  47. [47]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  48. [48]

    Large language models are better reasoners with self-verification

    Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2550–2575, 2023. 12

  49. [49]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  50. [50]

    Understanding hyperbolic metric learning through hard negative sampling

    Yun Yue, Fangzhou Lin, Guanyi Mou, and Ziming Zhang. Understanding hyperbolic metric learning through hard negative sampling. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1891–1903, 2024

  51. [51]

    Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, and Xipeng Qiu. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4651–4665, 2025

  52. [52]

    Chain of preference optimization: Improving chain-of-thought reasoning in llms.Advances in Neural Information Processing Systems, 37:333–356, 2024

    Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in llms.Advances in Neural Information Processing Systems, 37:333–356, 2024

  53. [53]

    Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö Arık. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

  54. [54]

    Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching

    Ziming Zhang, Fangzhou Lin, Haotian Liu, Jose Morales, Haichong Zhang, Kazunori Yamada, Vijaya B Kolachalama, and Venkatesh Saligrama. Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching. InThe Thirteenth International Conference on Learning Representations, 2025

  55. [55]

    Deep loss convexification for learning iterative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1501–1513, 2024

    Ziming Zhang, Yuping Shao, Yiqing Zhang, Fangzhou Lin, Haichong Zhang, and Elke Runden- steiner. Deep loss convexification for learning iterative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1501–1513, 2024

  56. [56]

    The majority is not always right: Rl training for solution aggregation.arXiv preprint arXiv:2509.06870, 2025

    Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, and Ilia Kulikov. The majority is not always right: Rl training for solution aggregation.arXiv preprint arXiv:2509.06870, 2025

  57. [57]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  58. [58]

    A setwise approach for effective and highly efficient zero-shot ranking with large language models

    Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. A setwise approach for effective and highly efficient zero-shot ranking with large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 38–47, 2024. 13 A Extended Related Work This appendix expands on the r...

  59. [59]

    DEDUP(C, ϕ 0)producesC ′ withN ′ ≤16and cluster sizesν

  60. [60]

    ELIMINATE (C′, J,1, ν,T) executes one halving round at E1: invokes SLAUGHTERwith key ν, performs⌊N ′/2⌋judge calls, returnsC A with|C A|=⌈N ′/2⌉

  61. [61]

    ELIMINATE (CA, J,2, S,T,stop_at=4) halves at E2 until f= 4 remain: for |CA|= 8 , this is one round of4judge calls

  62. [62]

    (Optional) RESCUE(C B,C ′ \ CB,0.15)may expandC B by one candidate

  63. [63]

    Stage C round-robin: |CB | 2 judge calls at E2, returnc ⋆ = arg maxc sC(c)via Eq. (4). Total judge calls in the deterministic pipeline (no rescue):8at E1 (Stage A) +4at E2 (Stage B) +6 at E2 (Stage C) =18judge calls, decomposed as8T 1 + 10T 2 in token cost. B.2 Implementation Details B.2.1 Per-Call Token Cost Decomposition The per-call token cost of a pai...

  64. [64]

    Asymptotic: TCAPS(N ′, f) = N ′ 2 (T1 +T 2)−f T 2 + f 2 T2 +O(logN ′), with marginal cost per candidate 1 2(T1 +T 2)

  65. [65]

    With CAPS-R: E[TCAPS+R] =T CAPS +p Rf T2, with pR ∈[0.10,0.15] adding less than 6% overhead. None of these depends on tunable hyperparameters of CAPS that affect cost: the cost is determined entirely by N, f, ρ, and the empirical trigger rate pR, all of which are properties of the deployment. D Experimental Details and Extended Results This appendix suppl...