CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
Pith reviewed 2026-05-19 15:35 UTC · model grok-4.3
The pith
CAPS uses a four-stage cascade to adapt evidence and pair selection so pairwise verification costs far less while selecting better answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAPS is an inference-only framework that allocates verifier compute non-uniformly along an evidence axis adapting how much of each candidate the judge sees and a distribution axis adapting how comparisons are spread across the pool. It instantiates these into a four-stage cascade with an optional rescue subroutine and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models and five reasoning benchmarks spanning code and math, CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code and outperforms pointwise self
What carries the argument
Four-stage cascade with adaptive evidence depth and comparison distribution plus optional rescue subroutine
If this is right
- Per-candidate marginal verifier cost drops by roughly half compared with uniform full-evidence schedules.
- Performance exceeds the leading pairwise verifier on 14 of 20 model-benchmark combinations.
- Performance exceeds pointwise self-verification on all 20 combinations tested.
- Suitability can be checked in advance by measuring how much the verifier's accuracy changes between partial and full evidence.
- The optional rescue subroutine recovers from early low-evidence mistakes on some problems.
Where Pith is reading between the lines
- Uniform full-evidence pairwise verification wastes tokens on many uninformative comparisons that staged decisions can avoid.
- The same evidence-and-distribution adaptation could be applied to other aggregation primitives such as majority voting or tree search.
- Saved tokens could be reinvested in sampling larger candidate pools, which would be expected to raise final accuracy further.
- The partial-versus-full accuracy diagnostic offers a practical way to decide whether any given verifier is suitable for cascaded use.
Load-bearing premise
The verifier stays accurate enough on partial evidence that early-stage decisions do not permanently eliminate the best candidate.
What would settle it
A set of problems where the verifier's accuracy on partial solutions is low enough that the correct answer is eliminated in stage one or two and the rescue subroutine fails to restore it.
Figures
read the original abstract
Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier's accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CAPS, a cascaded adaptive pairwise selection framework for efficient parallel reasoning in LLMs. It adapts verifier compute along an evidence axis (partial vs. full solution text) and a distribution axis across a pool of candidates via a four-stage cascade plus optional rescue subroutine. The work derives a closed-form verifier-token cost formula and reports that, across four self-verifying models and five reasoning benchmarks (code and math), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code tasks and outperforms pointwise self-verification on all 20.
Significance. If the central claims hold, the work offers a practical advance in test-time scaling by reducing aggregation cost in parallel reasoning without performance loss. The closed-form cost derivation and the interpretable diagnostic based on partial-versus-full evidence accuracy are strengths that enable predictable budgeting and pre-deployment checks, potentially influencing efficient inference methods for reasoning models.
major comments (3)
- [§3.2] §3.2 (Cascade Design): The four-stage cascade with optional rescue relies on the assumption that the self-verifier maintains sufficient accuracy on truncated (partial) evidence to avoid irrecoverable false negatives during early pruning; the manuscript provides no per-stage accuracy curves or ablation isolating the effect of partial-evidence errors on final selection quality, which is load-bearing for the reported token reduction and 14/20 win rate.
- [§4.1] §4.1 (Experimental Results): The outperformance claims on 14 of 20 suites and the 25.4% token-budget figure are presented without error bars, multiple-run statistics, or an ablation on cascade depth, making it difficult to assess robustness of the efficiency gains relative to uniform full-evidence pairwise verification.
- [§3.3] §3.3 (Cost Formula): The closed-form verifier-token cost derivation claims the per-candidate marginal cost is roughly halved, but the paper does not include a direct table or verification comparing the formula's predictions against the empirically measured token counts from the reported experiments.
minor comments (2)
- [Abstract] Abstract: Clarify the precise criterion for 'outperforms' (accuracy, cost, or joint) in the 14/20 claim and list all five benchmarks explicitly rather than summarizing.
- [Figure 2] Figure 2 or 3 (Diagnostic Plots): The trade-off suites would be strengthened by overlaying partial-versus-full evidence accuracy to directly illustrate the cascade suitability check mentioned in the abstract.
Simulated Author's Rebuttal
Thank you for your detailed and constructive review of our manuscript. We appreciate the referee's recognition of the potential practical advance in test-time scaling and the strengths of the closed-form cost derivation. We address each major comment below and will incorporate the suggested additions and analyses in the revised version.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Cascade Design): The four-stage cascade with optional rescue relies on the assumption that the self-verifier maintains sufficient accuracy on truncated (partial) evidence to avoid irrecoverable false negatives during early pruning; the manuscript provides no per-stage accuracy curves or ablation isolating the effect of partial-evidence errors on final selection quality, which is load-bearing for the reported token reduction and 14/20 win rate.
Authors: We agree that explicit per-stage accuracy curves and a targeted ablation would strengthen the justification for the cascade design. The manuscript already provides an interpretable diagnostic based on partial-versus-full evidence accuracy as a pre-deployment check, but we will add per-stage verifier accuracy plots across models and benchmarks in the revision, together with an ablation that measures the impact of early-stage partial-evidence errors on final selection quality and overall token savings. revision: yes
-
Referee: [§4.1] §4.1 (Experimental Results): The outperformance claims on 14 of 20 suites and the 25.4% token-budget figure are presented without error bars, multiple-run statistics, or an ablation on cascade depth, making it difficult to assess robustness of the efficiency gains relative to uniform full-evidence pairwise verification.
Authors: We acknowledge that reporting variability and cascade-depth ablations would improve assessment of robustness. Due to compute limits the original experiments used single runs; in the revision we will rerun key suites with multiple random seeds and report means with standard deviations. We will also add an ablation varying cascade depth (2-, 3-, and 4-stage configurations) to quantify the contribution of each stage to the observed efficiency and accuracy gains. revision: yes
-
Referee: [§3.3] §3.3 (Cost Formula): The closed-form verifier-token cost derivation claims the per-candidate marginal cost is roughly halved, but the paper does not include a direct table or verification comparing the formula's predictions against the empirically measured token counts from the reported experiments.
Authors: We will add a dedicated verification table in the revised manuscript that directly compares the closed-form cost predictions against the empirically measured verifier token counts for each model-benchmark pair. This table will confirm the claimed reduction (approximately 25.4% of the uniform full-evidence baseline on code tasks) and make the cost model fully transparent. revision: yes
Circularity Check
No circularity: derivation self-contained from cascade structure
full rationale
The paper derives its closed-form verifier-token cost directly from the explicit four-stage cascade structure and optional rescue subroutine, without fitting parameters to data or renaming predictions. No self-definitional loops, fitted inputs called predictions, or load-bearing self-citations appear in the provided derivation chain. The method is framed as an empirical scheduling rule whose marginal-cost halving follows from the non-uniform evidence allocation by construction of the stages, and central performance claims rest on external benchmark results rather than tautological reductions. The verifier accuracy assumption at partial evidence is stated as a pre-deployment check, not smuggled into the equations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Verifier accuracy at partial evidence is high enough for early cascade stages to be reliable
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
ACM Transactions on Intelligent Systems and Technology, 15(3)
Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023
-
[6]
Inference-aware fine-tuning for best-of-n sampling in large language models
Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust. Inference-aware fine-tuning for best-of-n sampling in large language models.arXiv preprint arXiv:2412.15287, 2024
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
László Csató. On the ranking of a swiss system chess team tournament.Annals of Operations Research, 254(1):17–36, 2017
work page 2017
-
[9]
Zilin Dai, Lehong Wang, Fangzhou Lin, Yidong Wang, Zhigang Li, Kazunori D Yamada, Ziming Zhang, and Wang Lu. A language anchor-guided method for robust noisy domain generalization.arXiv preprint arXiv:2503.17211, 2025
-
[10]
Round robin classification.Journal of Machine Learning Research, 2(Mar):721–747, 2002
Johannes Fürnkranz. Round robin classification.Journal of Machine Learning Research, 2(Mar):721–747, 2002
work page 2002
-
[11]
Large language models can self-improve
Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 1051–1068, 2023
work page 2023
-
[12]
Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004
David R Hunter. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004
work page 2004
-
[13]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
KANMixer: a minimal KAN-centered mixer for long-term time series forecasting
Lingyu Jiang, Yuping Wang, Yao Su, Shuo Xing, Wenjing Chen, Xin Zhang, Zhengzhong Tu, Ziming Zhang, Fangzhou Lin, Michael Zielewski, et al. Kanmixer: Can kan serve as a new modeling core for long-term time series forecasting?arXiv preprint arXiv:2508.01575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series Forecasting
Lingyu Jiang, Lingyu Xu, Peiran Li, Qianwen Ge, Dingyi Zhuang, Shuo Xing, Wenjing Chen, Xiangbo Gao, Ting-Hsuan Chen, Xueying Zhan, et al. Timepre: Bridging accuracy, efficiency, and stability in probabilistic time-series forecasting.arXiv preprint arXiv:2511.18539, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022
work page 2022
-
[18]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
work page 2023
-
[19]
Peiran Li, Fangzhou Lin, Shuo Xing, Jiashuo Sun, Dylan Zhang, Siyuan Yang, Chaoqun Ni, and Zhengzhong Tu. Let the abyss stare back adaptive falsification for autonomous scientific discovery.arXiv preprint arXiv:2603.29045, 2026
-
[20]
Peiran Li, Fangzhou Lin, Shuo Xing, Xiang Zheng, Xi Hong, Siyuan Yang, Jiashuo Sun, Zhengzhong Tu, and Chaoqun Ni. Bibagent: An agentic framework for traceable miscitation detection in scientific literature.arXiv preprint arXiv:2601.16993, 2026
-
[21]
Peiran Li, Jiashuo Sun, Fangzhou Lin, Shuo Xing, Tianfu Fu, Suofei Feng, Chaoqun Ni, and Zhengzhong Tu. Traversal-as-policy: Log-distilled gated behavior trees as externalized, verifiable policies for safe, robust, and efficient agents.arXiv preprint arXiv:2603.05517, 2026
-
[22]
Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022
work page 2022
-
[23]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[24]
Fangzhou Lin, Qianwen Ge, Lingyu Xu, Peiran Li, Xiangbo Gao, Shuo Xing, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Position: Human-centric ai requires a minimum viable level of human understanding.arXiv preprint arXiv:2602.00854, 2026
-
[25]
AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference
Fangzhou Lin, Peiran Li, Shuo Xing, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, and Zhengzhong Tu. Adaptfuse: Training-free sequential preference learning via externalized bayesian inference.arXiv preprint arXiv:2604.03925, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Pairwise rm: Perform best-of-n sampling with knockout tournament.arXiv e-prints, pages arXiv–2501, 2025
work page 2025
-
[27]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023
work page 2023
-
[28]
Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025
Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhut- dinov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025
-
[29]
Learning adaptive parallel reasoning with language models.arXiv preprint arXiv:2504.15466, 2025
Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. Learning adaptive parallel reasoning with language models.arXiv preprint arXiv:2504.15466, 2025
-
[30]
Plum: Prompt learning using metaheuristics
Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang Liu, KaShun Shum, Jipeng Zhang, Renjie Pi, and Tong Zhang. Plum: Prompt learning using metaheuristics. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 2177–2197, Bangkok, Thailand, August 2024. Association for Computationa...
work page 2024
-
[31]
Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024
work page 2024
-
[32]
Yuxiao Qu, Tianjun Zhang, Naman Garg, and Aviral Kumar. Recursive introspection: Teaching language model agents how to self-improve.Advances in Neural Information Processing Systems, 37:55249–55285, 2024. 11
work page 2024
-
[33]
Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or rl is suboptimal.arXiv preprint arXiv:2502.12118, 2025
-
[34]
Efficient fair queuing using deficit round-robin
Madhavapeddi Shreedhar and George Varghese. Efficient fair queuing using deficit round-robin. IEEE/ACM Transactions on networking, 4(3):375–385, 1996
work page 1996
-
[35]
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, et al. v_1: Unifying generation and self-verification for parallel reasoners.arXiv preprint arXiv:2603.04304, 2026
-
[36]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[38]
Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks.arXiv preprint arXiv:2402.08115, 2024
-
[39]
The efficacy of tournament designs.Computers & Operations Research, 144:105821, 2022
Balázs R Sziklai, Péter Biró, and László Csató. The efficacy of tournament designs.Computers & Operations Research, 144:105821, 2022
work page 2022
-
[40]
Confidence improves self-consistency in llms
Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20090–20111, 2025
work page 2025
-
[41]
On the use of random set theory to bracket the results of monte carlo simulations
Fulvio Tonon. On the use of random set theory to bracket the results of monte carlo simulations. Reliable Computing, 10(2):107–137, 2004
work page 2004
-
[42]
Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, et al. Re- cursive self-aggregation unlocks deep thinking in large language models.arXiv preprint arXiv:2509.26626, 2025
-
[43]
Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning
Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, and Kan Li. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6904–6917, 2025
work page 2025
-
[44]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Xuezhi Wang, Jason Wei, Dale Schuurmans, and Quoc V Le. H. chi, sharan narang, aakanksha chowdhery, and denny zhou. self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, volume 1, page 2, 2023
work page 2023
-
[46]
A survey on parallel reasoning.arXiv preprint arXiv:2510.12164, 2025
Ziqi Wang, Boye Niu, Zipeng Gao, Zhi Zheng, Tong Xu, Linghui Meng, Zhongli Li, Jing Liu, Yilong Chen, Chen Zhu, et al. A survey on parallel reasoning.arXiv preprint arXiv:2510.12164, 2025
-
[47]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[48]
Large language models are better reasoners with self-verification
Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2550–2575, 2023. 12
work page 2023
-
[49]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Understanding hyperbolic metric learning through hard negative sampling
Yun Yue, Fangzhou Lin, Guanyi Mou, and Ziming Zhang. Understanding hyperbolic metric learning through hard negative sampling. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1891–1903, 2024
work page 1903
-
[51]
Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, and Xipeng Qiu. Revisiting the test-time scaling of o1-like models: Do they truly possess test-time scaling capabilities? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4651–4665, 2025
work page 2025
-
[52]
Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in llms.Advances in Neural Information Processing Systems, 37:333–356, 2024
work page 2024
-
[53]
Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö Arık. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024
work page 2024
-
[54]
Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching
Ziming Zhang, Fangzhou Lin, Haotian Liu, Jose Morales, Haichong Zhang, Kazunori Yamada, Vijaya B Kolachalama, and Venkatesh Saligrama. Gps: A probabilistic distributional similarity with gumbel priors for set-to-set matching. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[55]
Ziming Zhang, Yuping Shao, Yiqing Zhang, Fangzhou Lin, Haichong Zhang, and Elke Runden- steiner. Deep loss convexification for learning iterative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1501–1513, 2024
work page 2024
-
[56]
Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, and Ilia Kulikov. The majority is not always right: Rl training for solution aggregation.arXiv preprint arXiv:2509.06870, 2025
-
[57]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
work page 2023
-
[58]
A setwise approach for effective and highly efficient zero-shot ranking with large language models
Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. A setwise approach for effective and highly efficient zero-shot ranking with large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 38–47, 2024. 13 A Extended Related Work This appendix expands on the r...
work page 2024
-
[59]
DEDUP(C, ϕ 0)producesC ′ withN ′ ≤16and cluster sizesν
-
[60]
ELIMINATE (C′, J,1, ν,T) executes one halving round at E1: invokes SLAUGHTERwith key ν, performs⌊N ′/2⌋judge calls, returnsC A with|C A|=⌈N ′/2⌉
-
[61]
ELIMINATE (CA, J,2, S,T,stop_at=4) halves at E2 until f= 4 remain: for |CA|= 8 , this is one round of4judge calls
-
[62]
(Optional) RESCUE(C B,C ′ \ CB,0.15)may expandC B by one candidate
-
[63]
Stage C round-robin: |CB | 2 judge calls at E2, returnc ⋆ = arg maxc sC(c)via Eq. (4). Total judge calls in the deterministic pipeline (no rescue):8at E1 (Stage A) +4at E2 (Stage B) +6 at E2 (Stage C) =18judge calls, decomposed as8T 1 + 10T 2 in token cost. B.2 Implementation Details B.2.1 Per-Call Token Cost Decomposition The per-call token cost of a pai...
-
[64]
Asymptotic: TCAPS(N ′, f) = N ′ 2 (T1 +T 2)−f T 2 + f 2 T2 +O(logN ′), with marginal cost per candidate 1 2(T1 +T 2)
-
[65]
With CAPS-R: E[TCAPS+R] =T CAPS +p Rf T2, with pR ∈[0.10,0.15] adding less than 6% overhead. None of these depends on tunable hyperparameters of CAPS that affect cost: the cost is determined entirely by N, f, ρ, and the empirical trigger rate pR, all of which are properties of the deployment. D Experimental Details and Extended Results This appendix suppl...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.