DeepPrune: Parallel Scaling without Inter-trace Redundancy

Juanzi Li; Lei Hou; Shangqing Tu; Yaxuan Li; Yushi Bai

arxiv: 2510.08483 · v2 · submitted 2025-10-09 · 💻 cs.CL · cs.AI

DeepPrune: Parallel Scaling without Inter-trace Redundancy

Shangqing Tu , Yaxuan Li , Yushi Bai , Lei Hou , Juanzi Li This is my paper

Pith reviewed 2026-05-18 08:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords parallel scalingChain-of-Thoughttoken efficiencyredundancy pruningjudge modelgreedy clusteringLLM reasoningconsensus sampling

0 comments

The pith

DeepPrune prunes redundant parallel reasoning traces early to reduce token consumption by 66 to 88 percent with accuracy nearly unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to eliminate wasted computation in parallel scaling of large language models, where over 80 percent of multiple Chain-of-Thought traces converge to the same answer. It does so by training a judge model on older problems to predict from incomplete traces whether two paths will agree, then applying greedy clustering to drop the duplicates on the fly. A reader would care because this inefficiency currently limits how many traces can be run in practice for better reasoning. The result is a framework that delivers large efficiency gains on math and science benchmarks while staying close to full sampling performance.

Core claim

DeepPrune addresses inter-trace redundancy in parallel scaling by using a judge model trained with out-of-distribution data and oversampling to predict answer equivalence from partial reasoning traces, achieving 0.7072 AUROC on unseen models, and combining it with an online greedy clustering algorithm to dynamically prune redundant paths while preserving answer diversity, leading to 65.73 percent to 88.50 percent token reduction on AIME 2024, AIME 2025, and GPQA benchmarks compared to conventional consensus sampling with accuracy within 3 percentage points.

What carries the argument

Judge model for predicting answer equivalence from partial traces paired with online greedy clustering algorithm for dynamic pruning of redundant reasoning paths.

If this is right

Substantial reduction in the number of tokens required for parallel reasoning compared to generating all traces fully.
Competitive performance maintained across challenging benchmarks like AIME and GPQA.
Ability to scale parallel methods to more traces or larger models without proportional compute increase.
The pruning works across different reasoning models without needing to retrain the judge for each one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar early-pruning techniques could apply to other sampling-based methods in language model inference beyond reasoning tasks.
Testing the judge on a wider range of model families would reveal how general the equivalence prediction is.
Integrating this with adaptive number of traces based on detected diversity could further optimize resource use.
Resource-limited deployments of advanced reasoning might become feasible through these savings.

Load-bearing premise

The judge model trained on specific out-of-distribution math datasets can accurately predict final answer equivalence from partial traces generated by different reasoning models on new benchmarks.

What would settle it

Applying DeepPrune to reasoning traces from a model and benchmark completely outside the training distribution and checking whether the accuracy stays within 3 percentage points of full sampling or if many correct answers are pruned away.

Figures

Figures reproduced from arXiv: 2510.08483 by Juanzi Li, Lei Hou, Shangqing Tu, Yaxuan Li, Yushi Bai.

**Figure 1.** Figure 1: DeepPrune conducts early stopping based on the similarity between reasoning treaces to enhance the efficiency of parallel scaling and save diverse traces. This advancement is driven by inference-time scaling (Jaech et al., 2024), a new paradigm that enhances LLM’s reasoning capabilities via more computing in the test stage (Snell et al., 2025). Generally, there are two types of inferencetime scaling: s… view at source ↗

**Figure 2.** Figure 2: Analysis of Inter-trace Redundancy. (a) Distribution of same vs. different answer pairs of reasoning traces, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the DeepPrune framework. The offline training phase (top) involves constructing trace pair datasets with binary labels indicating answer equivalence, then training a judge model using focal loss and oversampling to address class imbalance. The online pruning phase (bottom) leverages the trained judge model to perform dynamic pruning via greedy clustering where traces are assigned to existing cl… view at source ↗

**Figure 4.** Figure 4: Ablation study on judge model with different truncation strategies for unfinished reasoning traces. We [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072 AUROC on unseen reasoning models. Combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction of 65.73%--88.50% compared to conventional consensus sampling, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: https://deepprune.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepPrune's reported token cuts depend on a judge that only hits 0.7072 AUROC on unseen models, which looks like the main risk to both efficiency and accuracy claims.

read the letter

The main point with this paper is that the big token reductions come from a judge whose AUROC is only 0.7072 on unseen models. That leaves real room for misclassifications that could eat into both the savings and the accuracy claims. They train the judge on out-of-distribution contest data with oversampling to predict if partial traces lead to the same answer. Then an online greedy clustering prunes the redundant ones while keeping diversity. On AIME 2024, 2025 and GPQA they report 65-88% fewer tokens than standard consensus sampling, with accuracy drops under 3 points. The public code link is a plus. This is new in how they combine the partial-trace judge with dynamic pruning specifically for parallel scaling. Prior work has pruning and early exits, but this targets the inter-trace redundancy they measured at over 80%. The weak part is the judge performance. 0.7072 AUROC suggests non-trivial false positives and negatives, and without precision-recall curves or propagation studies it's hard to know if the accuracy stays competitive when the judge is wrong. The abstract lacks ablations and error bars, so the central efficiency story needs checking in the full paper. This work is for people doing LLM inference optimization and test-time compute. A reader looking for practical ways to scale parallel reasoning would find the benchmarks and numbers useful. It deserves a serious referee because the idea addresses a clear bottleneck with measurable results. I would send it out for review, but flag the need for stronger validation on the judge's reliability.

Referee Report

2 major / 2 minor

Summary. The paper proposes DeepPrune, a framework for efficient parallel scaling of LLM reasoning via dynamic pruning of redundant CoT traces. It trains a judge model on out-of-distribution data (AIME 2022/2023 and MATH 500) with oversampling to predict answer equivalence from partial traces (0.7072 AUROC on unseen models), combines this with online greedy clustering to prune while preserving diversity, and reports 65.73%–88.50% token reduction versus conventional consensus sampling with accuracy within 3 percentage points on AIME 2024, AIME 2025, and GPQA across multiple reasoning models. Code and data are released.

Significance. If the efficiency and accuracy claims hold under the judge’s actual error distribution, the work would meaningfully reduce wasted computation in parallel reasoning without sacrificing performance, addressing a documented redundancy issue (>80% identical answers). The release of code and data strengthens reproducibility and enables direct verification of the reported token savings and benchmark results.

major comments (2)

Abstract and evaluation sections: The headline claim of 65.73%–88.50% token reduction while staying within 3 pp accuracy depends on the judge correctly classifying answer equivalence from partial traces. The reported 0.7072 AUROC on unseen models implies non-negligible false-positive and false-negative rates, yet no precision-recall curves, calibration analysis, or ablation measuring how judge misclassifications propagate into final benchmark accuracy or token counts are provided. This leaves the robustness of the simultaneous efficiency and accuracy claims unverified.
Method description (judge training and clustering): The equivalence-prediction threshold and clustering similarity cutoff are listed as free parameters in the experimental setup. Without a sensitivity analysis or explicit statement of how these thresholds were chosen on held-out data (distinct from the AIME 2024/2025 and GPQA test sets), it is difficult to assess whether the reported gains are stable or partly the result of post-hoc tuning.

minor comments (2)

Abstract: The phrase 'accurately predict' is inconsistent with the moderate 0.7072 AUROC; a more precise qualifier such as 'with moderate discriminative power' would better reflect the quantitative result.
Figure or table presenting token-reduction and accuracy numbers: Error bars or standard deviations across runs or models are not mentioned; adding them would clarify the stability of the within-3-pp accuracy claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and have revised the manuscript accordingly to improve the robustness analysis.

read point-by-point responses

Referee: Abstract and evaluation sections: The headline claim of 65.73%–88.50% token reduction while staying within 3 pp accuracy depends on the judge correctly classifying answer equivalence from partial traces. The reported 0.7072 AUROC on unseen models implies non-negligible false-positive and false-negative rates, yet no precision-recall curves, calibration analysis, or ablation measuring how judge misclassifications propagate into final benchmark accuracy or token counts are provided. This leaves the robustness of the simultaneous efficiency and accuracy claims unverified.

Authors: We agree that the AUROC of 0.7072 indicates the presence of classification errors and that their propagation into final metrics should be quantified. In the revised manuscript we have added precision-recall curves and a calibration analysis for the judge model (new Figure 4). We have also included an ablation that perturbs the judge outputs according to the observed error rates and re-measures end-to-end accuracy and token savings on the test benchmarks; the results confirm that accuracy stays within the reported 3 pp margin and token reductions remain above 60 %. revision: yes
Referee: Method description (judge training and clustering): The equivalence-prediction threshold and clustering similarity cutoff are listed as free parameters in the experimental setup. Without a sensitivity analysis or explicit statement of how these thresholds were chosen on held-out data (distinct from the AIME 2024/2025 and GPQA test sets), it is difficult to assess whether the reported gains are stable or partly the result of post-hoc tuning.

Authors: The thresholds were selected on a held-out validation split of the training data (AIME 2022/2023 and MATH 500) by maximizing a joint objective of equivalence-prediction F1 and downstream accuracy. We have now added an explicit statement of this procedure in Section 3.3 and a sensitivity analysis in the new Appendix C that sweeps the equivalence threshold from 0.4 to 0.6 and the clustering cutoff from 0.7 to 0.9. Across this range, accuracy varies by less than 4 pp and token savings by less than 6 %, indicating stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; empirical method relies on held-out evaluation

full rationale

The paper describes an empirical pipeline: analysis of >80% inter-trace answer duplication, followed by training a judge model on AIME 2022/2023 + MATH 500 (with oversampling) to predict equivalence from partial traces, reporting 0.7072 AUROC on unseen models, then applying an online greedy clustering pruner to achieve 65.73–88.50% token reduction on AIME 2024/2025 and GPQA while staying within 3 pp accuracy. No equations, self-definitional loops, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The reported gains are measured on separate benchmarks and models rather than reducing to the training inputs by construction, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on the generalization ability of the trained judge and the assumption that early pruning decisions based on partial traces do not systematically eliminate correct but initially divergent answers.

free parameters (2)

equivalence prediction threshold
Decision boundary used by the judge to declare two partial traces equivalent; value is not stated in abstract but must be chosen or tuned.
clustering similarity cutoff
Parameter controlling when the greedy algorithm merges or prunes traces; affects the diversity-accuracy trade-off.

axioms (1)

domain assumption Partial reasoning traces contain sufficient signal to predict final-answer equivalence before completion.
Core premise enabling early pruning; invoked when the judge operates on incomplete traces.

pith-pipeline@v0.9.0 · 5777 in / 1331 out tokens · 44610 ms · 2026-05-18T08:34:09.926257+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

STOP is a new learnable internal path-pruning technique that improves efficiency and accuracy of parallel reasoning in LRMs under fixed compute budgets.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925. Pranjal Aggarwal and Sean Welleck. 2025. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697. AIME. 2025. Aime problems and solutions. Anthropic. 2024. Anthropic: Introducing claude 3.5 sonnet. Daman Arora and Andrea Zanette. 20...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

arXiv preprint arXiv:2503.05179

Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching.arXiv preprint arXiv:2503.05179. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirho- seini. 2024. Large language monkeys: Scaling infer- ence compute with repeated sampling.arXiv preprint arXiv:2407.21787. Lingjiao Che...

work page arXiv 2024
[3]

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al

work page internal anchor Pith review Pith/arXiv arXiv
[4]

OpenAI o1 System Card

Openai o1 system card.arXiv preprint arXiv:2412.16720. Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Adaptive group policy optimization: Towards stable training and token-efficient reasoning

C3ot: Generating shorter chain-of-thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi- cient memory management for large language mo...

work page arXiv 2023
[6]

Efficient inference for large reasoning models: A survey,

Can language models learn to skip steps?Ad- vances in Neural Information Processing Systems, 37:45359–45385. Yue Liu, Jiaying Wu, Yufei He, Ruihan Gong, Jun Xia, Liang Li, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, et al. 2025. Efficient inference for large reasoning models: A survey.arXiv preprint arXiv:2503.23077. Michael Luo, Sijun Tan, Jus...

work page arXiv 2025
[7]

https://github.com/ rllm-org/rllm

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://github.com/ rllm-org/rllm. GitHub. MAA. 2024. American invitational mathematics exami- nation - aime. Lovish Madaan, Aniket Didolkar, Suchin Gururan- gan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal

work page 2024
[8]

Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

Rethinking thinking tokens: Llms as improve- ment operators.arXiv preprint arXiv:2510.01123. Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. 2025. Aimo-2 winning solu- tion: Building state-of-the-art mathematical reason- ing models with openmathreasoning dataset.arXiv preprint a...

work page arXiv 2025
[9]

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, An- drew Wen, Shaochen Zhong, Na Zou, et al

Towards reasoning ability of small language models.arXiv preprint arXiv:2502.11569. Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, An- drew Wen, Shaochen Zhong, Na Zou, et al. 2025. Stop overthinking: A survey on efficient reason- ing for large language models.arXiv preprint arXiv:2503.16419. Kimi Team, Angang D...

work page arXiv 2025
[10]

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li

Siri: Scaling iterative reinforcement learn- ing with interleaved compression.arXiv preprint arXiv:2509.25176. Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. 2025. Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067. Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. 2025. Chain of draft: Th...

work page arXiv 2025
[11]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agen- tic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

It helps mitigate errors from individual traces and leverages the collective intelligence of diverse reasoning paths

Effectiveness and Common Practice:Majority voting is a widely adopted and empirically effective method for aggregating multiple reasoning traces to derive a robust final answer, as demonstrated by pioneering works like Self-Consistency (Wang et al., 2022). It helps mitigate errors from individual traces and leverages the collective intelligence of diverse...

work page 2022
[13]

Fair Comparison with Baselines:To enable a direct and fair comparison of final answer accuracy with methods like DeepConf that also produce a single aggregated answer, we needed a mechanism to consolidate the diverse traces retained by DeepPrune into one final prediction. While our method inherently preserves inter-trace diversity for potential pass@k eva...

work page 2025

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925. Pranjal Aggarwal and Sean Welleck. 2025. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697. AIME. 2025. Aime problems and solutions. Anthropic. 2024. Anthropic: Introducing claude 3.5 sonnet. Daman Arora and Andrea Zanette. 20...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

arXiv preprint arXiv:2503.05179

Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching.arXiv preprint arXiv:2503.05179. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirho- seini. 2024. Large language monkeys: Scaling infer- ence compute with repeated sampling.arXiv preprint arXiv:2407.21787. Lingjiao Che...

work page arXiv 2024

[3] [3]

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

OpenAI o1 System Card

Openai o1 system card.arXiv preprint arXiv:2412.16720. Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Adaptive group policy optimization: Towards stable training and token-efficient reasoning

C3ot: Generating shorter chain-of-thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi- cient memory management for large language mo...

work page arXiv 2023

[6] [6]

Efficient inference for large reasoning models: A survey,

Can language models learn to skip steps?Ad- vances in Neural Information Processing Systems, 37:45359–45385. Yue Liu, Jiaying Wu, Yufei He, Ruihan Gong, Jun Xia, Liang Li, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, et al. 2025. Efficient inference for large reasoning models: A survey.arXiv preprint arXiv:2503.23077. Michael Luo, Sijun Tan, Jus...

work page arXiv 2025

[7] [7]

https://github.com/ rllm-org/rllm

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://github.com/ rllm-org/rllm. GitHub. MAA. 2024. American invitational mathematics exami- nation - aime. Lovish Madaan, Aniket Didolkar, Suchin Gururan- gan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal

work page 2024

[8] [8]

Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

Rethinking thinking tokens: Llms as improve- ment operators.arXiv preprint arXiv:2510.01123. Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. 2025. Aimo-2 winning solu- tion: Building state-of-the-art mathematical reason- ing models with openmathreasoning dataset.arXiv preprint a...

work page arXiv 2025

[9] [9]

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, An- drew Wen, Shaochen Zhong, Na Zou, et al

Towards reasoning ability of small language models.arXiv preprint arXiv:2502.11569. Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, An- drew Wen, Shaochen Zhong, Na Zou, et al. 2025. Stop overthinking: A survey on efficient reason- ing for large language models.arXiv preprint arXiv:2503.16419. Kimi Team, Angang D...

work page arXiv 2025

[10] [10]

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li

Siri: Scaling iterative reinforcement learn- ing with interleaved compression.arXiv preprint arXiv:2509.25176. Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. 2025. Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067. Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. 2025. Chain of draft: Th...

work page arXiv 2025

[11] [11]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agen- tic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

It helps mitigate errors from individual traces and leverages the collective intelligence of diverse reasoning paths

Effectiveness and Common Practice:Majority voting is a widely adopted and empirically effective method for aggregating multiple reasoning traces to derive a robust final answer, as demonstrated by pioneering works like Self-Consistency (Wang et al., 2022). It helps mitigate errors from individual traces and leverages the collective intelligence of diverse...

work page 2022

[13] [13]

Fair Comparison with Baselines:To enable a direct and fair comparison of final answer accuracy with methods like DeepConf that also produce a single aggregated answer, we needed a mechanism to consolidate the diverse traces retained by DeepPrune into one final prediction. While our method inherently preserves inter-trace diversity for potential pass@k eva...

work page 2025