pith. machine review for the scientific record. sign in

arxiv: 2502.21074 · v3 · pith:OLBY5FDUnew · submitted 2025-02-28 · 💻 cs.CL

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

Pith reviewed 2026-05-17 23:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords chain-of-thoughtimplicit CoTself-distillationcontinuous spacecompressionGSM8Klanguage modelslatent reasoning
0
0 comments X

The pith

Self-distillation aligns one token's hidden state to transfer chain-of-thought reasoning into continuous space without accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to compress natural-language chain-of-thought steps into a model's continuous hidden representations. It jointly trains an explicit CoT teacher and an implicit CoT student, then distills the reasoning ability by forcing the student to match the hidden state of one designated token produced by the teacher. A reader would care because this removes the need to generate long sequences of text for each reasoning step, promising shorter outputs and possibly more robust behavior. At GPT-2 scale the resulting implicit model reaches the same accuracy as explicit CoT on GSM8k for the first time among implicit methods, while compressing the reasoning representation by a factor of 3.1 and beating earlier implicit approaches by 28.2 percent accuracy. The work also reports that the continuous-space version generalizes to harder datasets and offers some interpretability of the internal reasoning trace.

Core claim

CODI jointly trains a teacher on explicit natural-language chain-of-thought and a student on implicit continuous-space reasoning, then distills the teacher's reasoning capability into the student by aligning the hidden state of a single designated token. This alignment transfers the multi-step reasoning process into latent space, allowing the student to match the teacher's accuracy on GSM8k while using a 3.1 times shorter representation.

What carries the argument

Self-distillation via alignment of the hidden state of one designated token between the explicit CoT teacher and the implicit CoT student.

Load-bearing premise

Aligning the hidden states of a single designated token is enough to transfer the full reasoning capability from language to continuous space without loss or distortion.

What would settle it

Train the CODI student on GSM8k with the stated alignment loss and check whether its final accuracy remains within a few points of the explicit teacher's accuracy.

read the original abstract

Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by encouraging step-by-step reasoning in natural language. However, leveraging a latent continuous space for reasoning may offer benefits in terms of both efficiency and robustness. Prior implicit CoT methods attempt to bypass language completely by reasoning in continuous space but have consistently underperformed compared to the standard explicit CoT approach. We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space. CODI jointly trains a teacher task (Explicit CoT) and a student task (Implicit CoT), distilling the reasoning ability from language into continuous space by aligning the hidden states of a designated token. Our experiments show that CODI is the first implicit CoT approach to match the performance of explicit CoT on GSM8k at the GPT-2 scale, achieving a 3.1x compression rate and outperforming the previous state-of-the-art by 28.2% in accuracy. CODI also demonstrates robustness, generalizable to complex datasets, and interpretability. These results validate that LLMs can reason effectively not only in natural language, but also in a latent continuous space. Code is available at https://github.com/zhenyi4/codi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CODI, a self-distillation framework that jointly trains an explicit CoT teacher and an implicit CoT student model. Reasoning is compressed into continuous space by aligning the hidden states of a single designated token between teacher and student, bypassing explicit language steps. On GSM8K at GPT-2 scale, the method claims to match explicit CoT accuracy for the first time among implicit approaches, with a 3.1x compression rate and 28.2% accuracy gain over prior implicit CoT SOTA; additional claims include robustness and interpretability.

Significance. If the central empirical result holds, this would be a notable contribution to implicit reasoning in LLMs by demonstrating that latent continuous space can achieve parity with explicit language-based CoT without sacrificing accuracy. The self-distillation approach, reported compression factor, and open-sourced code (https://github.com/zhenyi4/codi) are strengths that support reproducibility and further exploration of continuous-space reasoning.

major comments (2)
  1. Method section: The distillation relies on aligning the hidden state of only one designated token between the explicit teacher and implicit student. Because explicit CoT performs autoregressive multi-step reasoning where each token's representation conditions on prior steps, a single-vector target risks encoding only a summary embedding rather than the full reasoning trajectory; performance parity could then arise from direct question-to-answer mapping instead of genuine latent reasoning. Ablations comparing single-token vs. multi-token or trajectory alignment are needed to substantiate the claim.
  2. Experiments section: The reported accuracy gains and SOTA outperformance lack error bars, variance across runs, or detailed baseline comparisons and ablations on the designated token choice. These omissions make it difficult to evaluate whether the 28.2% improvement and explicit-CoT parity are robust or sensitive to implementation details.
minor comments (2)
  1. Abstract and results: Claims of 'robustness' and 'interpretability' are stated without accompanying quantitative metrics or analysis in the provided summary; these should be supported by specific numbers or figures.
  2. Notation: The description of the 'designated token' is somewhat vague; clarifying its identity (e.g., final token, special token) and how its hidden state is extracted would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of demonstrating parity between implicit and explicit CoT at this scale. We address each major comment below with clarifications and commitments to revisions that strengthen the empirical support without altering the core claims.

read point-by-point responses
  1. Referee: Method section: The distillation relies on aligning the hidden state of only one designated token between the explicit teacher and implicit student. Because explicit CoT performs autoregressive multi-step reasoning where each token's representation conditions on prior steps, a single-vector target risks encoding only a summary embedding rather than the full reasoning trajectory; performance parity could then arise from direct question-to-answer mapping instead of genuine latent reasoning. Ablations comparing single-token vs. multi-token or trajectory alignment are needed to substantiate the claim.

    Authors: We agree that the choice of alignment target merits further scrutiny. In CODI the designated token is the final token of the input sequence; because the teacher processes the full explicit CoT autoregressively, this token's hidden state is conditioned on every preceding reasoning step. Nevertheless, to directly address the concern that performance might reflect a shortcut rather than latent reasoning, we will add a new ablation subsection in the revised manuscript. It will compare (i) single final-token alignment, (ii) alignment over the last k tokens, and (iii) a trajectory-level loss that matches hidden states at every reasoning step. We will report accuracy, compression ratio, and training dynamics for each variant on GSM8K, thereby providing evidence that the single-token design captures the essential trajectory while remaining efficient. revision: yes

  2. Referee: Experiments section: The reported accuracy gains and SOTA outperformance lack error bars, variance across runs, or detailed baseline comparisons and ablations on the designated token choice. These omissions make it difficult to evaluate whether the 28.2% improvement and explicit-CoT parity are robust or sensitive to implementation details.

    Authors: We acknowledge the absence of statistical reporting and token-choice ablations in the current draft. In the revision we will rerun all main experiments and baselines with five random seeds, reporting mean accuracy together with standard deviation. We will also expand the experimental section with (a) a table of additional baselines that includes prior implicit CoT methods with their original hyper-parameters re-implemented under our training regime, and (b) an ablation varying the position and number of designated tokens. These additions will allow readers to assess both robustness and sensitivity to the alignment choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical training framework

full rationale

CODI is an empirical self-distillation training procedure that jointly optimizes an explicit-CoT teacher and an implicit-CoT student by aligning hidden states of one designated token; performance numbers on GSM8k and other benchmarks are obtained from standard supervised fine-tuning and evaluation runs rather than from any closed-form derivation or equation that reduces the reported accuracy to a fitted parameter by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the method description or results. The central claim therefore remains externally falsifiable through replication on held-out data and does not collapse into its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the work is an empirical training procedure rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5550 in / 1071 out tokens · 27007 ms · 2026-05-17T23:11:08.279207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  2. PLUME: Latent Reasoning Based Universal Multimodal Embedding

    cs.CV 2026-04 unverdicted novelty 7.0

    PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.

  3. The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

    eess.AS 2026-03 unverdicted novelty 7.0

    FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.

  4. SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving

    cs.CL 2025-11 unverdicted novelty 7.0

    SpiralThinker stabilizes iterative latent reasoning in LLMs via text-latent interleaving and progressive alignment, achieving SOTA results among latent baselines on math, logic, and commonsense tasks.

  5. Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

    cs.AI 2025-10 unverdicted novelty 7.0

    CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.

  6. STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

    cs.CL 2026-05 unverdicted novelty 6.0

    STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

  7. When Less is Enough: Efficient Inference via Collaborative Reasoning

    cs.LG 2026-05 conditional novelty 6.0

    A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.

  8. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  9. LEPO: Latent Reasoning Policy Optimization for Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.

  10. MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

    cs.AI 2026-04 unverdicted novelty 6.0

    MemoSight unifies context compression and multi-token prediction via special tokens and tailored position layouts to reduce KV cache by up to 66% and accelerate inference by 1.56x while outperforming prior CoT compres...

  11. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  12. The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...

  13. Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

    cs.LG 2026-04 conditional novelty 6.0

    LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.

  14. Mull-Tokens: Modality-Agnostic Latent Thinking

    cs.CV 2025-12 unverdicted novelty 6.0

    Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.

  15. LEPO: Latent Reasoning Policy Optimization for Large Language Models

    cs.LG 2026-04 unverdicted novelty 5.0

    LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.

  16. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

  17. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  18. MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...

  19. ConFu: Contemplate the Future for Better Speculative Sampling

    cs.CL 2026-03 unverdicted novelty 5.0

    ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.

  20. Deep Thinking by Markov Chain of Continuous Thoughts

    cs.LG 2025-09 unverdicted novelty 5.0

    MarCos modifies transformers to perform continuous multi-step reasoning by mapping thought-level continuous states directly to next-thought distributions, achieving substantial wall-clock speedups on math problems.

  21. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    cs.CL 2025-03 accept novelty 5.0

    A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

  22. OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework

    cs.IR 2026-03 unverdicted novelty 4.0

    OneSearch-V2 improves generative retrieval via latent reasoning and self-distillation, achieving +3.98% item CTR, +2.07% buyer volume, and +2.11% order volume in online A/B tests.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 19 Pith papers · 7 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback , url =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

  2. [2]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  3. [3]

    Publications Manual , year = "1983", publisher =

  4. [4]

    Journal of the ACM (JACM)28(1), 114–133 (1981) https://doi.org/10.1145/322234.322243 24

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  5. [5]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  6. [6]

    Dan Gusfield , title =. 1997

  7. [7]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  8. [8]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Numpages =

  9. [9]

    2024 , eprint=

    Training Large Language Models to Reason in a Continuous Latent Space , author=. 2024 , eprint=

  10. [10]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    In-Context Learning State Vector with Inner and Momentum Optimization , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  11. [11]

    ArXiv , year=

    In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering , author=. ArXiv , year=

  12. [12]

    ArXiv , year=

    Implicit Chain of Thought Reasoning via Knowledge Distillation , author=. ArXiv , year=

  13. [13]

    ArXiv , year=

    From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step , author=. ArXiv , year=

  14. [14]

    The Eleventh International Conference on Learning Representations , year=

    Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought , author=. The Eleventh International Conference on Learning Representations , year=

  15. [15]

    ArXiv , year=

    Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

  16. [17]

    2023 , editor =

    Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , booktitle =. 2023 , editor =

  17. [20]

    NeurIPS , year=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

  18. [21]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

  19. [22]

    The Thirteenth International Conference on Learning Representations , year=

    To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning , author=. The Thirteenth International Conference on Learning Representations , year=

  20. [23]

    Large Language Models are Zero-Shot Reasoners , url =

    Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large Language Models are Zero-Shot Reasoners , url =

  21. [24]

    The Twelfth International Conference on Learning Representations , year=

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  22. [25]

    Graph of Thoughts: Solving Elaborate Problems with Large Language Models , volume=

    Besta, Maciej and Blach, Nils and Kubicek, Ales and Gerstenberger, Robert and Podstawski, Michal and Gianinazzi, Lukas and Gajda, Joanna and Lehmann, Tomasz and Niewiadomski, Hubert and Nyczyk, Piotr and Hoefler, Torsten , year=. Graph of Thoughts: Solving Elaborate Problems with Large Language Models , volume=. Proceedings of the AAAI Conference on Artif...

  23. [26]

    2023 , eprint=

    Reasoning with Language Model is Planning with World Model , author=. 2023 , eprint=

  24. [27]

    2024 , eprint=

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , author=. 2024 , eprint=

  25. [28]

    Bowman , booktitle=

    Jacob Pfau and William Merrill and Samuel R. Bowman , booktitle=. Let. 2024 , url=

  26. [29]

    The Twelfth International Conference on Learning Representations , year=

    Think before you speak: Training Language Models With Pause Tokens , author=. The Twelfth International Conference on Learning Representations , year=

  27. [30]

    2025 , url=

    Hadas Orgad and Michael Toker and Zorik Gekhman and Roi Reichart and Idan Szpektor and Hadas Kotek and Yonatan Belinkov , booktitle=. 2025 , url=

  28. [31]

    2024 , eprint=

    Compressed Chain of Thought: Efficient Reasoning Through Dense Representations , author=. 2024 , eprint=

  29. [32]

    2021 , eprint=

    Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. 2021 , eprint=

  30. [33]

    2024 , url=

    Claude 3.5 sonnet , author=. 2024 , url=

  31. [34]

    2024 , url=

    Our next-generation model: Gemini 1.5 , author=. 2024 , url=

  32. [35]

    2024 , url=

    Hello GPT-4o , author=. 2024 , url=

  33. [36]

    2024 , url=

    Learning to reason with llms , author=. 2024 , url=

  34. [37]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  35. [38]

    Philosophical Magazine , volume =

    Karl Pearson , title =. Philosophical Magazine , volume =

  36. [39]

    2022 , eprint=

    Auto-Encoding Variational Bayes , author=. 2022 , eprint=

  37. [40]

    2015 , eprint=

    Deep Residual Learning for Image Recognition , author=. 2015 , eprint=

  38. [41]

    2019 , eprint=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

  39. [42]

    2015 , eprint=

    Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

  40. [43]

    Hinton and Oriol Vinyals and Jeffrey Dean , title =

    Geoffrey E. Hinton and Oriol Vinyals and Jeffrey Dean , title =. CoRR , volume =. 2015 , url =

  41. [44]

    Amalric and S

    M. Amalric and S. Dehaene , title =. Proceedings of the National Academy of Sciences , volume =. 2016 , doi =

  42. [45]

    Amalric and S

    M. Amalric and S. Dehaene , title =. NeuroImage , volume =. 2019 , month =. doi:10.1016/j.neuroimage.2019.01.001 , url =

  43. [46]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  44. [47]

    The Twelfth International Conference on Learning Representations , year=

    In-context Autoencoder for Context Compression in a Large Language Model , author=. The Twelfth International Conference on Learning Representations , year=

  45. [51]

    2024 , eprint=

    Keypoint-based Progressive Chain-of-Thought Distillation for LLMs , author=. 2024 , eprint=

  46. [52]

    SCOTT : Self-Consistent Chain-of-Thought Distillation

    Wang, Peifeng and Wang, Zhengyang and Li, Zheng and Gao, Yifan and Yin, Bing and Ren, Xiang. SCOTT : Self-Consistent Chain-of-Thought Distillation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.304

  47. [53]

    M o DE - C o TD : Chain-of-Thought Distillation for Complex Reasoning Tasks with Mixture of Decoupled L o RA -Experts

    Li, Xiang and He, Shizhu and Wu, Jiayu and Yang, Zhao and Xu, Yao and Jun, Yang jun and Liu, Haifeng and Liu, Kang and Zhao, Jun. M o DE - C o TD : Chain-of-Thought Distillation for Complex Reasoning Tasks with Mixture of Decoupled L o RA -Experts. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and ...

  48. [54]

    2015 , eprint=

    FitNets: Hints for Thin Deep Nets , author=. 2015 , eprint=

  49. [55]

    2019 , eprint=

    Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , author=. 2019 , eprint=

  50. [57]

    The Twelfth International Conference on Learning Representations , year=

    The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations , year=

  51. [58]

    Critical Tokens Matter: Token-Level Contrastive Estimation Enhances

    Zicheng Lin and Tian Liang and Jiahao Xu and Qiuzhi Liu and Xing Wang and Ruilin Luo and Chufan Shi and Siheng Li and Yujiu Yang and Zhaopeng Tu , booktitle=. Critical Tokens Matter: Token-Level Contrastive Estimation Enhances. 2025 , url=

  52. [60]

    2024 , eprint=

    A Survey on Knowledge Distillation of Large Language Models , author=. 2024 , eprint=

  53. [63]

    2023 , eprint=

    Less is More: Task-aware Layer-wise Distillation for Language Model Compression , author=. 2023 , eprint=

  54. [65]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  55. [66]

    Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

  56. [67]

    2019 , url=

    Language Models are Unsupervised Multitask Learners , author=. 2019 , url=

  57. [68]

    ArXiv , year=

    Multi-Task Learning with Deep Neural Networks: A Survey , author=. ArXiv , year=

  58. [70]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Conceptnet 5.5: An open multilingual graph of general knowledge , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  59. [71]

    Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan , journal =

  60. [72]

    ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

  61. [74]

    Forty-second International Conference on Machine Learning , year=

    Deliberation in Latent Space via Differentiable Cache Augmentation , author=. Forty-second International Conference on Machine Learning , year=

  62. [75]

    Forty-second International Conference on Machine Learning , year=

    Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning , author=. Forty-second International Conference on Machine Learning , year=

  63. [76]

    2011 , url=

    Thinking, Fast and Slow , author=. 2011 , url=

  64. [77]

    2025 , eprint=

    A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond , author=. 2025 , eprint=

  65. [78]

    2025 , eprint=

    Efficient Reasoning Models: A Survey , author=. 2025 , eprint=

  66. [81]

    ArXiv , year=

    Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning , author=. ArXiv , year=

  67. [82]

    The Thirteenth International Conference on Learning Representations , year=

    Reasoning with Latent Thoughts: On the Power of Looped Transformers , author=. The Thirteenth International Conference on Learning Representations , year=

  68. [83]

    The Twelfth International Conference on Learning Representations , year=

    Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. The Twelfth International Conference on Learning Representations , year=

  69. [84]

    Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes

    Rao Vijjini, Anvesh and Anuranjana, Kaveri and Mamidi, Radhika. Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes. Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2021

  70. [85]

    Anthropic. 2024. https://www.anthropic.com/news/claude-3 -5-sonnet Claude 3.5 sonnet

  71. [86]

    Jeffrey Cheng and Benjamin Van Durme. 2024. https://arxiv.org/abs/2412.13171 Compressed chain of thought: Efficient reasoning through dense representations . Preprint, arXiv:2412.13171

  72. [87]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://api.semanticscholar.org/CorpusID:239998651 Training verifiers to solve math word problems . ArXiv, abs/2110.14168

  73. [88]

    Michael Crawshaw. 2020. https://api.semanticscholar.org/CorpusID:221819295 Multi-task learning with deep neural networks: A survey . ArXiv, abs/2009.09796

  74. [89]

    Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. https://api.semanticscholar.org/CorpusID:269982648 From explicit cot to implicit cot: Learning to internalize cot step by step . ArXiv, abs/2405.14838

  75. [90]

    Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. 2023. https://api.semanticscholar.org/CorpusID:264935229 Implicit chain of thought reasoning via knowledge distillation . ArXiv, abs/2311.01460

  76. [91]

    Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vuli \'c . 2025. https://doi.org/10.18653/v1/2025.naacl-long.444 UNDIAL : Self-distillation with adjusted logits for robust unlearning in large language models . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

  77. [92]

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. https://proceedings.mlr.press/v202/gao23f.html PAL : Program-aided language models . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10764--10799. PMLR

  78. [93]

    Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. 2024. https://openreview.net/forum?id=uREj4ZuGJE In-context autoencoder for context compression in a large language model . In The Twelfth International Conference on Learning Representations

  79. [94]

    Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

    Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. 2025. https://openreview.net/forum?id=D6o6Bwtq7h Scaling up test-time compute with latent reasoning: A recurrent depth approach . In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

  80. [95]

    Google. 2024. https://blog.google/techno logy/ai/google-gemini-next-generation-model-february-2024 Our next-generation model: Gemini 1.5

Showing first 80 references.