arxiv: 2605.02290 · v1 · submitted 2026-05-04 · 💻 cs.AI

Recognition: unknown

Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding

Taewon Yun , Jisu Shin , Jeonghwan Choi , Seunghwan Bang , Hwanjun Song

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords CoRDLong-CoT distillationmulti-teacher decodingstep-wise synthesisperplexity scoringbeam searchreasoning modelsmodel distillation

0 comments

The pith

Collaborative step-wise decoding lets multiple large reasoning models build higher-quality Long-CoT traces for distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoRD to improve how long chain-of-thought reasoning is distilled from large models into smaller ones. Standard approaches generate full traces from one model at a time and pick the best afterward, which wastes effort and loses chances for different models to fill in each other's gaps during generation. CoRD instead has heterogeneous teachers collaborate at each reasoning step, scoring partial paths by how well they predict the next token and using beam search to keep only the strongest branches. The resulting data trains student models to nearly match the teachers' performance while needing fewer examples and adding little extra cost during creation. The same data also supports strong results on tasks from new domains and on open-ended questions.

Core claim

CoRD is a collaborative multi-teacher decoding framework that performs step-wise reasoning synthesis guided by predictive perplexity-based scoring and beam search. This enables heterogeneous LRMs to jointly construct coherent reasoning trajectories while efficiently preserving diverse, high-potential hypotheses, yielding higher-quality reasoning data that supports near teacher-level student performance with fewer supervision signals and without substantial efficiency overhead.

What carries the argument

CoRD, the collaborative multi-teacher decoding framework that uses predictive perplexity scoring of partial trajectories together with beam search to let heterogeneous models synthesize coherent reasoning paths step by step.

If this is right

Students trained on CoRD data reach performance close to the full teacher models.
Fewer structured examples are needed to achieve that performance level.
Data generation adds no substantial computational overhead beyond standard decoding.
The quality advantage transfers to out-of-domain and open-ended tasks.
Diverse reasoning styles are retained while coherence is enforced at each step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same step-wise collaboration could be adapted for live ensemble inference rather than only offline data creation.
Smaller or differently trained teachers might still contribute useful partial steps under the same scoring rule.
Adding secondary filters such as logical consistency checks could reduce any remaining selection bias.
Public release of the generated traces lets others test scaling to larger sets of teachers or new task families.

Load-bearing premise

Predictive perplexity scoring combined with beam search will surface coherent, high-potential reasoning trajectories from different teachers without systematic bias or loss of complementary information.

What would settle it

If a student model trained on CoRD data shows no meaningful gain over data from single-teacher generation or random trace selection on the same reasoning benchmarks, the claimed advantage would not hold.

Figures

Figures reproduced from arXiv: 2605.02290 by Hwanjun Song, Jeonghwan Choi, Jisu Shin, Seunghwan Bang, Taewon Yun.

**Figure 1.** Figure 1: Overview of CoRD: Teacher LRMs collaboratively decode reasoning steps via prompt-guided segmentation. At each step, candidate steps are evaluated via predictive perplexity, retaining the top-𝐵 reasoning trajectories for subsequent decoding. The gray dotted line indicates the auto-regressive flow of reasoning steps. unit of generation, allowing teacher LRMs to collaboratively propose and integrate step prop… view at source ↗

**Figure 2.** Figure 2: Teacher selection hit rates (%) in CoRD over reasoning progress where decoding steps are mapped to a 0–100% scale to align varying trajectory lengths. Analysis of Collaboration Dynamics. To understand the source of CoRD’s advantage, we examine teacher selection hit rates in view at source ↗

**Figure 3.** Figure 3: Performance comparison of student models view at source ↗

**Figure 4.** Figure 4: Hit rates of three LRMs during expansion across three step units based on step locations view at source ↗

**Figure 5.** Figure 5: Comparison of hit rates during expansion view at source ↗

read the original abstract

Distilling large reasoning models is essential for making Long-CoT reasoning practical, as full-scale inference remains computationally prohibitive. Existing curation-based approaches select complete reasoning traces post-hoc, overlooking collaboration among heterogeneous teachers and lacking dynamic exploration, which leads to redundant sampling and missed complementary reasoning. We introduce CoRD, a collaborative multi-teacher decoding framework that performs step-wise reasoning synthesis guided by predictive perplexity-based scoring and beam search. This enables heterogeneous LRMs to jointly construct coherent reasoning trajectories while efficiently preserving diverse, high-potential hypotheses. Experiments show that CoRD produces higher-quality reasoning data and achieves near teacher-level student performance with fewer, structured supervision signals, without substantial efficiency overhead. CoRD further generalizes well to out-of-domain and open-ended settings. The dataset and model are available at \href{https://github.com/DISL-Lab/CoRD}{https://github.com/DISL-Lab/CoRD}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoRD adds a step-wise multi-teacher decoding loop with perplexity scoring and beam search to curate Long-CoT data, and the reported student gains look usable but rest on an untested link between low perplexity and actual reasoning quality.

read the letter

CoRD is a collaborative decoding setup where several large reasoning models build Long-CoT traces together, one step at a time, using predictive perplexity to rank partial paths and beam search to keep options open. The central claim is that this produces cleaner training data than post-hoc selection and lets smaller students reach near-teacher performance with less overhead and better out-of-domain results. The code and dataset are released, which is helpful for anyone who wants to try it directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoRD, a collaborative step-wise multi-teacher decoding framework for distilling long-CoT reasoning from heterogeneous large reasoning models (LRMs). It performs dynamic synthesis of reasoning trajectories using predictive perplexity-based scoring combined with beam search, allowing teachers to jointly construct coherent paths while preserving diverse hypotheses. The central claims are that this produces higher-quality reasoning data than post-hoc curation, enables student models to reach near teacher-level performance using fewer structured supervision signals, incurs no substantial efficiency overhead, and generalizes to out-of-domain and open-ended tasks. The dataset and models are released publicly.

Significance. If the results hold under rigorous validation, the work could meaningfully advance efficient distillation of long-chain reasoning capabilities. By shifting from static post-hoc selection to collaborative, step-wise synthesis across heterogeneous teachers, it addresses redundancy in sampling and loss of complementary information. The public release of data and code supports reproducibility and follow-up work in the area of scalable reasoning model training.

major comments (2)

[§3] §3 (CoRD framework description): The method's core mechanism—ranking trajectories by predictive perplexity and selecting via beam search—assumes lower perplexity reliably indicates higher-quality, logically sound reasoning. However, perplexity primarily reflects token-level fluency and predictability rather than logical correctness or completeness. No ablation or correlation analysis is provided to test whether this scoring systematically discards correct but higher-perplexity paths or favors teacher-specific styles, which directly undermines the claims of higher-quality data and near-teacher student performance.
[§5] §5 (Experiments): The reported outcomes lack sufficient quantitative detail on exact metrics (e.g., accuracy, pass@1), baselines (including single-teacher and post-hoc methods), statistical significance, variance across runs, or controls for teacher similarity. Without these, it is impossible to verify the magnitude of gains, rule out post-hoc selection effects, or confirm generalization claims to OOD and open-ended settings.

minor comments (2)

The abstract would be strengthened by including one or two key quantitative results (e.g., student accuracy relative to teachers) to make the performance claims concrete.
Notation for predictive perplexity scoring and beam search parameters should be defined more explicitly with equations to improve clarity of the algorithmic procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their constructive comments, which have helped us improve the clarity and rigor of our work. We respond to each major comment below and indicate the revisions made.

read point-by-point responses

Referee: [§3] The method's core mechanism—ranking trajectories by predictive perplexity and selecting via beam search—assumes lower perplexity reliably indicates higher-quality, logically sound reasoning. However, perplexity primarily reflects token-level fluency and predictability rather than logical correctness or completeness. No ablation or correlation analysis is provided to test whether this scoring systematically discards correct but higher-perplexity paths or favors teacher-specific styles, which directly undermines the claims of higher-quality data and near-teacher student performance.

Authors: We acknowledge that perplexity serves primarily as a proxy for token-level fluency and predictability rather than direct logical correctness. In the CoRD framework, it is employed step-wise within the collaborative beam search to enable heterogeneous teachers to jointly explore and extend promising reasoning prefixes while preserving diversity. To directly address the concern, the revised manuscript includes a new ablation study comparing perplexity-guided selection against random selection and alternative heuristics, along with an analysis of correlation between selected trajectories' perplexity and downstream task accuracy (as a proxy for logical soundness). These additions support the efficacy of the approach while explicitly noting the limitations of perplexity as a proxy. revision: yes
Referee: [§5] The reported outcomes lack sufficient quantitative detail on exact metrics (e.g., accuracy, pass@1), baselines (including single-teacher and post-hoc methods), statistical significance, variance across runs, or controls for teacher similarity. Without these, it is impossible to verify the magnitude of gains, rule out post-hoc selection effects, or confirm generalization claims to OOD and open-ended settings.

Authors: We agree that greater quantitative rigor is needed for verifiability. The revised Section 5 now provides expanded tables with exact accuracy and pass@1 values for CoRD versus all baselines, including single-teacher decoding and post-hoc curation methods. Results are reported as means with standard deviations over five independent runs, accompanied by paired t-test p-values for statistical significance. We further include controls for teacher similarity by varying model families and architectures, and supply specific quantitative metrics for the OOD and open-ended task evaluations to substantiate the generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity in CoRD's procedural framework

full rationale

The paper introduces CoRD as an empirical procedural framework for step-wise multi-teacher decoding guided by perplexity scoring and beam search, with performance claims supported solely by experimental results on distillation quality and generalization. No equations, derivations, or self-referential definitions appear that would reduce reported gains (e.g., near-teacher student performance) to quantities defined by the method's own inputs or fitted parameters. There are no load-bearing self-citations, uniqueness theorems, or ansatzes that collapse the central claims by construction. This matches the expectation for a non-mathematical method paper where the derivation chain is absent and thus cannot be circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard assumptions of beam search and perplexity as proxies for reasoning quality; no new free parameters, axioms, or invented entities are introduced beyond the named CoRD procedure itself.

pith-pipeline@v0.9.0 · 5464 in / 1095 out tokens · 43620 ms · 2026-05-09T16:26:22.694254+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 33 canonical work pages · 10 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

CoRR , year=

Reasoning with large language models, a survey , author=. CoRR , year=
[3]

Teaching small language models to reason,

Teaching small language models to reason , author=. arXiv preprint arXiv:2212.08410 , year=

work page arXiv
[4]

Findings of ACL , year=

Teaching Small Language Models to Reason for Knowledge-Intensive Multi-Hop Question Answering , author=. Findings of ACL , year=
[5]

Naturalthoughts: Selecting and distilling reasoning traces for general reasoning tasks, 2025

NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks , author=. arXiv preprint arXiv:2507.01921 , year=

work page arXiv
[6]

Solving math word problems with process- and outcome-based feedback

Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

ICLR , year=

Let's verify step by step , author=. ICLR , year=
[8]

NAACL , year=

Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning , author=. NAACL , year=
[9]

arXiv preprint arXiv:2501.00430 , year=

Enhancing llm reasoning with multi-path collaborative reactive and reflection agents , author=. arXiv preprint arXiv:2501.00430 , year=

work page arXiv
[10]

Efficient test-time scaling via self-calibration

Efficient test-time scaling via self-calibration , author=. arXiv preprint arXiv:2503.00031 , year=

work page arXiv
[11]

NeurIPS , year=

Chain-of-thought prompting elicits reasoning in large language models , author=. NeurIPS , year=
[12]

& Kumar, A

Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=

work page arXiv
[13]

EMNLP , year=

Sequence-to-Sequence Learning as Beam-Search Optimization , author=. EMNLP , year=
[14]

NeurIPS , year=

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search , author=. NeurIPS , year=
[15]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review arXiv
[16]

ACL , year=

Towards Widening The Distillation Bottleneck for Reasoning Models , author=. ACL , year=
[17]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review arXiv
[18]

EMNLP , year=

Sequence-level knowledge distillation , author=. EMNLP , year=
[19]

Optimizing test-time compute via meta reinforcement fine-tuning

Optimizing test-time compute via meta reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.07572 , year=

work page arXiv
[20]

NeurIPS , year=

Rest-mcts*: Llm self-training via process reward guided tree search , author=. NeurIPS , year=
[21]

ICML , year=

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking , author=. ICML , year=
[22]

ECML , year=

Bandit based monte-carlo planning , author=. ECML , year=
[23]

OpenThoughts: Data Recipes for Reasoning Models

OpenThoughts: Data Recipes for Reasoning Models , author=. arXiv preprint arXiv:2506.04178 , year=

work page internal anchor Pith review arXiv
[24]

arXiv preprint arXiv:2505.21067 , year=

Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning , author=. arXiv preprint arXiv:2505.21067 , year=

work page arXiv
[25]

ACL , year=

From English to Second Language Mastery: Enhancing LLMs with Cross-Lingual Continued Instruction Tuning , author=. ACL , year=
[26]

arXiv preprint arXiv:2502.13173 , year=

Thinking preference optimization , author=. arXiv preprint arXiv:2502.13173 , year=

work page arXiv
[27]

Distillation: Understanding Accuracy and Capability in LLM Reasoning , author=

Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning , author=. arXiv preprint arXiv:2505.14216 , year=

work page arXiv
[28]

When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025

When more is less: Understanding chain-of-thought length in llms , author=. arXiv preprint arXiv:2502.07266 , year=

work page arXiv
[29]

ACL , year=

Large language models are reasoning teachers , author=. ACL , year=
[30]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

ACL , year=

Can large language models detect errors in long chain-of-thought reasoning? , author=. ACL , year=
[32]

ICML , year=

Demystifying long chain-of-thought reasoning in llms , author=. ICML , year=
[33]

Small models struggle to learn from strong reasoners, 2025

Small models struggle to learn from strong reasoners , author=. arXiv preprint arXiv:2502.12143 , year=

work page arXiv
[34]

ACL , year=

Enhancing mathematical reasoning in llms by stepwise correction , author=. ACL , year=
[35]

arXiv preprint arXiv:2509.13758 , year=

A Study on Thinking Patterns of Large Reasoning Models in Code Generation , author=. arXiv preprint arXiv:2509.13758 , year=

work page arXiv
[36]

arXiv preprint arXiv:2406.04692 , year=

Mixture-of-agents enhances large language model capabilities , author=. arXiv preprint arXiv:2406.04692 , year=

work page arXiv
[37]

Deconstructing long chain-of-thought: A structured reasoning optimization framework for long cot distillation

Deconstructing long chain-of-thought: A structured reasoning optimization framework for long cot distillation , author=. arXiv preprint arXiv:2503.16385 , year=

work page arXiv
[38]

NAACL , year=

Learning to Summarize from LLM-generated Feedback , author=. NAACL , year=
[39]

arXiv preprint arXiv:2410.03663 , year=

Learning from committee: Reasoning distillation from a mixture of teachers with peer-review , author=. arXiv preprint arXiv:2410.03663 , year=

work page arXiv
[40]

Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning.arXiv preprint arXiv:2506.02867, 2025

Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning , author=. arXiv preprint arXiv:2506.02867 , year=

work page arXiv
[41]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling , author=. arXiv preprint arXiv:2502.06703 , year=

work page arXiv
[42]

NeurIPS , year=

Ensemble of averages: Improving model selection and boosting performance in domain generalization , author=. NeurIPS , year=
[43]

COLM , year=

Assessing Judging Bias in Large Reasoning Models: An Empirical Study , author=. COLM , year=
[44]

arXiv preprint arXiv:2309.03118 , year=

Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs , author=. arXiv preprint arXiv:2309.03118 , year=

work page arXiv
[45]

Exploring the System 1 Thinking Capability of Large Reasoning Models

S1-bench: A simple benchmark for evaluating system 1 thinking capability of large reasoning models , author=. arXiv preprint arXiv:2504.10368 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Ryan Liu, Jiayi Geng, Addison J

THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models , author=. arXiv preprint arXiv:2505.22113 , year=

work page arXiv
[47]

Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024

Step-dpo: Step-wise preference optimization for long-chain reasoning of llms , author=. arXiv preprint arXiv:2406.18629 , year=

work page arXiv
[48]

Alphazero-like tree-search can guide large lan- guage model decoding and training.arXiv preprint arXiv:2309.17179, 2023

Alphazero-like tree-search can guide large language model decoding and training , author=. arXiv preprint arXiv:2309.17179 , year=

work page arXiv
[49]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=

work page internal anchor Pith review arXiv
[50]

NAACL , year=

Learning vs retrieval: The role of in-context examples in regression with large language models , author=. NAACL , year=
[51]

ACL , year=

The heuristic core: Understanding subnetwork generalization in pretrained language models , author=. ACL , year=
[52]

ACL , year=

Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering , author=. ACL , year=
[53]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

work page internal anchor Pith review arXiv
[54]

Findings of ACL , year=

Chain of Methodologies: Scaling Test Time Computation without Training , author=. Findings of ACL , year=
[55]

arXiv preprint arXiv:2506.15721 , year=

Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration , author=. arXiv preprint arXiv:2506.15721 , year=

work page arXiv
[56]

ACL , year=

Processbench: Identifying process errors in mathematical reasoning , author=. ACL , year=
[57]

Phi-4-reasoning technical report, 2025

Phi-4-reasoning technical report , author=. arXiv preprint arXiv:2504.21318 , year=

work page arXiv
[58]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

COLM , year=

Limo: Less is more for reasoning , author=. COLM , year=
[60]

KDD , year=

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. KDD , year=
[61]

NAACL , year=

Communication is all you need: Persuasion dataset construction via multi-llm communication , author=. NAACL , year=
[62]

NeurIPS , year=

Measuring mathematical problem solving with the math dataset , author=. NeurIPS , year=
[63]

ACL , year=

TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance , author=. ACL , year=
[64]

EMNLP , year=

Pubmedqa: A dataset for biomedical research question answering , author=. EMNLP , year=
[65]

WSDM , year=

Ext2Gen: Alignment through Unified Extraction and Generation for Robust Retrieval-Augmented Generation , author=. WSDM , year=
[66]

Transactions of the Association for Computational Linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the Association for Computational Linguistics , volume=
[67]

NeurIPS , year=

Babilong: Testing the limits of llms with long context reasoning-in-a-haystack , author=. NeurIPS , year=
[68]

COLM , year=

ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback , author=. COLM , year=
[69]

Findings of ACL , year=

Word2Passage: Word-level Importance Re-weighting for Query Expansion , author=. Findings of ACL , year=