arxiv: 2502.21074 · v3 · pith:OLBY5FDUnew · submitted 2025-02-28 · 💻 cs.CL

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

Zhenyi Shen , Hanqi Yan , Linhai Zhang , Zhanghao Hu , Yali Du , Yulan He This is my paper

Pith reviewed 2026-05-17 23:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords chain-of-thoughtimplicit CoTself-distillationcontinuous spacecompressionGSM8Klanguage modelslatent reasoning

0 comments

The pith

Self-distillation aligns one token's hidden state to transfer chain-of-thought reasoning into continuous space without accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to compress natural-language chain-of-thought steps into a model's continuous hidden representations. It jointly trains an explicit CoT teacher and an implicit CoT student, then distills the reasoning ability by forcing the student to match the hidden state of one designated token produced by the teacher. A reader would care because this removes the need to generate long sequences of text for each reasoning step, promising shorter outputs and possibly more robust behavior. At GPT-2 scale the resulting implicit model reaches the same accuracy as explicit CoT on GSM8k for the first time among implicit methods, while compressing the reasoning representation by a factor of 3.1 and beating earlier implicit approaches by 28.2 percent accuracy. The work also reports that the continuous-space version generalizes to harder datasets and offers some interpretability of the internal reasoning trace.

Core claim

CODI jointly trains a teacher on explicit natural-language chain-of-thought and a student on implicit continuous-space reasoning, then distills the teacher's reasoning capability into the student by aligning the hidden state of a single designated token. This alignment transfers the multi-step reasoning process into latent space, allowing the student to match the teacher's accuracy on GSM8k while using a 3.1 times shorter representation.

What carries the argument

Self-distillation via alignment of the hidden state of one designated token between the explicit CoT teacher and the implicit CoT student.

Load-bearing premise

Aligning the hidden states of a single designated token is enough to transfer the full reasoning capability from language to continuous space without loss or distortion.

What would settle it

Train the CODI student on GSM8k with the stated alignment loss and check whether its final accuracy remains within a few points of the explicit teacher's accuracy.

read the original abstract

Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by encouraging step-by-step reasoning in natural language. However, leveraging a latent continuous space for reasoning may offer benefits in terms of both efficiency and robustness. Prior implicit CoT methods attempt to bypass language completely by reasoning in continuous space but have consistently underperformed compared to the standard explicit CoT approach. We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space. CODI jointly trains a teacher task (Explicit CoT) and a student task (Implicit CoT), distilling the reasoning ability from language into continuous space by aligning the hidden states of a designated token. Our experiments show that CODI is the first implicit CoT approach to match the performance of explicit CoT on GSM8k at the GPT-2 scale, achieving a 3.1x compression rate and outperforming the previous state-of-the-art by 28.2% in accuracy. CODI also demonstrates robustness, generalizable to complex datasets, and interpretability. These results validate that LLMs can reason effectively not only in natural language, but also in a latent continuous space. Code is available at https://github.com/zhenyi4/codi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CODI reaches explicit CoT accuracy on GSM8K with GPT-2 by aligning one token's hidden state in a joint explicit-implicit training setup, which is new for implicit methods but rests on a thin transfer assumption.

read the letter

The main thing to know is that this paper gets an implicit CoT model to match explicit CoT performance on GSM8K at the GPT-2 scale. Prior implicit approaches fell short, but CODI closes the gap with a self-distillation setup that trains both tasks together and aligns the hidden state of a single designated token between the explicit teacher and implicit student. They report a 3.1x compression and a 28% accuracy improvement over the previous implicit state of the art, plus some claims around robustness and generalizability to harder datasets.

Referee Report

2 major / 2 minor

Summary. The paper introduces CODI, a self-distillation framework that jointly trains an explicit CoT teacher and an implicit CoT student model. Reasoning is compressed into continuous space by aligning the hidden states of a single designated token between teacher and student, bypassing explicit language steps. On GSM8K at GPT-2 scale, the method claims to match explicit CoT accuracy for the first time among implicit approaches, with a 3.1x compression rate and 28.2% accuracy gain over prior implicit CoT SOTA; additional claims include robustness and interpretability.

Significance. If the central empirical result holds, this would be a notable contribution to implicit reasoning in LLMs by demonstrating that latent continuous space can achieve parity with explicit language-based CoT without sacrificing accuracy. The self-distillation approach, reported compression factor, and open-sourced code (https://github.com/zhenyi4/codi) are strengths that support reproducibility and further exploration of continuous-space reasoning.

major comments (2)

Method section: The distillation relies on aligning the hidden state of only one designated token between the explicit teacher and implicit student. Because explicit CoT performs autoregressive multi-step reasoning where each token's representation conditions on prior steps, a single-vector target risks encoding only a summary embedding rather than the full reasoning trajectory; performance parity could then arise from direct question-to-answer mapping instead of genuine latent reasoning. Ablations comparing single-token vs. multi-token or trajectory alignment are needed to substantiate the claim.
Experiments section: The reported accuracy gains and SOTA outperformance lack error bars, variance across runs, or detailed baseline comparisons and ablations on the designated token choice. These omissions make it difficult to evaluate whether the 28.2% improvement and explicit-CoT parity are robust or sensitive to implementation details.

minor comments (2)

Abstract and results: Claims of 'robustness' and 'interpretability' are stated without accompanying quantitative metrics or analysis in the provided summary; these should be supported by specific numbers or figures.
Notation: The description of the 'designated token' is somewhat vague; clarifying its identity (e.g., final token, special token) and how its hidden state is extracted would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of demonstrating parity between implicit and explicit CoT at this scale. We address each major comment below with clarifications and commitments to revisions that strengthen the empirical support without altering the core claims.

read point-by-point responses

Referee: Method section: The distillation relies on aligning the hidden state of only one designated token between the explicit teacher and implicit student. Because explicit CoT performs autoregressive multi-step reasoning where each token's representation conditions on prior steps, a single-vector target risks encoding only a summary embedding rather than the full reasoning trajectory; performance parity could then arise from direct question-to-answer mapping instead of genuine latent reasoning. Ablations comparing single-token vs. multi-token or trajectory alignment are needed to substantiate the claim.

Authors: We agree that the choice of alignment target merits further scrutiny. In CODI the designated token is the final token of the input sequence; because the teacher processes the full explicit CoT autoregressively, this token's hidden state is conditioned on every preceding reasoning step. Nevertheless, to directly address the concern that performance might reflect a shortcut rather than latent reasoning, we will add a new ablation subsection in the revised manuscript. It will compare (i) single final-token alignment, (ii) alignment over the last k tokens, and (iii) a trajectory-level loss that matches hidden states at every reasoning step. We will report accuracy, compression ratio, and training dynamics for each variant on GSM8K, thereby providing evidence that the single-token design captures the essential trajectory while remaining efficient. revision: yes
Referee: Experiments section: The reported accuracy gains and SOTA outperformance lack error bars, variance across runs, or detailed baseline comparisons and ablations on the designated token choice. These omissions make it difficult to evaluate whether the 28.2% improvement and explicit-CoT parity are robust or sensitive to implementation details.

Authors: We acknowledge the absence of statistical reporting and token-choice ablations in the current draft. In the revision we will rerun all main experiments and baselines with five random seeds, reporting mean accuracy together with standard deviation. We will also expand the experimental section with (a) a table of additional baselines that includes prior implicit CoT methods with their original hyper-parameters re-implemented under our training regime, and (b) an ablation varying the position and number of designated tokens. These additions will allow readers to assess both robustness and sensitivity to the alignment choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical training framework

full rationale

CODI is an empirical self-distillation training procedure that jointly optimizes an explicit-CoT teacher and an implicit-CoT student by aligning hidden states of one designated token; performance numbers on GSM8k and other benchmarks are obtained from standard supervised fine-tuning and evaluation runs rather than from any closed-form derivation or equation that reduces the reported accuracy to a fitted parameter by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the method description or results. The central claim therefore remains externally falsifiable through replication on held-out data and does not collapse into its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the work is an empirical training procedure rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5550 in / 1071 out tokens · 27007 ms · 2026-05-17T23:11:08.279207+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
PLUME: Latent Reasoning Based Universal Multimodal Embedding
cs.CV 2026-04 unverdicted novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 7.0

FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving
cs.CL 2025-11 unverdicted novelty 7.0

SpiralThinker stabilizes iterative latent reasoning in LLMs via text-latent interleaving and progressive alignment, achieving SOTA results among latent baselines on math, logic, and commonsense tasks.
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
cs.AI 2025-10 unverdicted novelty 7.0

CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
cs.CL 2026-05 unverdicted novelty 6.0

STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
When Less is Enough: Efficient Inference via Collaborative Reasoning
cs.LG 2026-05 conditional novelty 6.0

A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
LEPO: Latent Reasoning Policy Optimization for Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.
MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration
cs.AI 2026-04 unverdicted novelty 6.0

MemoSight unifies context compression and multi-token prediction via special tokens and tailored position layouts to reduce KV cache by up to 66% and accelerate inference by 1.56x while outperforming prior CoT compres...
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
cs.LG 2026-04 unverdicted novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
cs.LG 2026-04 conditional novelty 6.0

LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
Mull-Tokens: Modality-Agnostic Latent Thinking
cs.CV 2025-12 unverdicted novelty 6.0

Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
LEPO: Latent Reasoning Policy Optimization for Large Language Models
cs.LG 2026-04 unverdicted novelty 5.0

LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
ConFu: Contemplate the Future for Better Speculative Sampling
cs.CL 2026-03 unverdicted novelty 5.0

ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.
Deep Thinking by Markov Chain of Continuous Thoughts
cs.LG 2025-09 unverdicted novelty 5.0

MarCos modifies transformers to perform continuous multi-step reasoning by mapping thought-level continuous states directly to next-thought distributions, achieving substantial wall-clock speedups on math problems.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
cs.IR 2026-03 unverdicted novelty 4.0

OneSearch-V2 improves generative retrieval via latent reasoning and self-distillation, achieving +3.98% item CTR, +2.07% buyer volume, and +2.11% order volume in online A/B tests.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 19 Pith papers · 7 internal anchors

[1]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

work page
[2]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[3]

Publications Manual , year = "1983", publisher =

work page 1983
[4]

Journal of the ACM (JACM)28(1), 114–133 (1981) https://doi.org/10.1145/322234.322243 24

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[5]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[6]

Dan Gusfield , title =. 1997

work page 1997
[7]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[8]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Numpages =

work page
[9]

2024 , eprint=

Training Large Language Models to Reason in a Continuous Latent Space , author=. 2024 , eprint=

work page 2024
[10]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

In-Context Learning State Vector with Inner and Momentum Optimization , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[11]

ArXiv , year=

In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering , author=. ArXiv , year=

work page
[12]

ArXiv , year=

Implicit Chain of Thought Reasoning via Knowledge Distillation , author=. ArXiv , year=

work page
[13]

ArXiv , year=

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step , author=. ArXiv , year=

work page
[14]

The Eleventh International Conference on Learning Representations , year=

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought , author=. The Eleventh International Conference on Learning Representations , year=

work page
[15]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

work page
[17]

2023 , editor =

Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , booktitle =. 2023 , editor =

work page 2023
[20]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

work page
[21]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

work page
[22]

The Thirteenth International Conference on Learning Representations , year=

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[23]

Large Language Models are Zero-Shot Reasoners , url =

Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large Language Models are Zero-Shot Reasoners , url =

work page
[24]

The Twelfth International Conference on Learning Representations , year=

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[25]

Graph of Thoughts: Solving Elaborate Problems with Large Language Models , volume=

Besta, Maciej and Blach, Nils and Kubicek, Ales and Gerstenberger, Robert and Podstawski, Michal and Gianinazzi, Lukas and Gajda, Joanna and Lehmann, Tomasz and Niewiadomski, Hubert and Nyczyk, Piotr and Hoefler, Torsten , year=. Graph of Thoughts: Solving Elaborate Problems with Large Language Models , volume=. Proceedings of the AAAI Conference on Artif...

work page doi:10.1609/aaai.v38i16.29720
[26]

2023 , eprint=

Reasoning with Language Model is Planning with World Model , author=. 2023 , eprint=

work page 2023
[27]

2024 , eprint=

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , author=. 2024 , eprint=

work page 2024
[28]

Bowman , booktitle=

Jacob Pfau and William Merrill and Samuel R. Bowman , booktitle=. Let. 2024 , url=

work page 2024
[29]

The Twelfth International Conference on Learning Representations , year=

Think before you speak: Training Language Models With Pause Tokens , author=. The Twelfth International Conference on Learning Representations , year=

work page
[30]

2025 , url=

Hadas Orgad and Michael Toker and Zorik Gekhman and Roi Reichart and Idan Szpektor and Hadas Kotek and Yonatan Belinkov , booktitle=. 2025 , url=

work page 2025
[31]

2024 , eprint=

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations , author=. 2024 , eprint=

work page 2024
[32]

2021 , eprint=

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. 2021 , eprint=

work page 2021
[33]

2024 , url=

Claude 3.5 sonnet , author=. 2024 , url=

work page 2024
[34]

2024 , url=

Our next-generation model: Gemini 1.5 , author=. 2024 , url=

work page 2024
[35]

2024 , url=

Hello GPT-4o , author=. 2024 , url=

work page 2024
[36]

2024 , url=

Learning to reason with llms , author=. 2024 , url=

work page 2024
[37]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[38]

Philosophical Magazine , volume =

Karl Pearson , title =. Philosophical Magazine , volume =

work page
[39]

2022 , eprint=

Auto-Encoding Variational Bayes , author=. 2022 , eprint=

work page 2022
[40]

2015 , eprint=

Deep Residual Learning for Image Recognition , author=. 2015 , eprint=

work page 2015
[41]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

work page 2019
[42]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

work page 2015
[43]

Hinton and Oriol Vinyals and Jeffrey Dean , title =

Geoffrey E. Hinton and Oriol Vinyals and Jeffrey Dean , title =. CoRR , volume =. 2015 , url =

work page 2015
[44]

Amalric and S

M. Amalric and S. Dehaene , title =. Proceedings of the National Academy of Sciences , volume =. 2016 , doi =

work page 2016
[45]

Amalric and S

M. Amalric and S. Dehaene , title =. NeuroImage , volume =. 2019 , month =. doi:10.1016/j.neuroimage.2019.01.001 , url =

work page doi:10.1016/j.neuroimage.2019.01.001 2019
[46]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[47]

The Twelfth International Conference on Learning Representations , year=

In-context Autoencoder for Context Compression in a Large Language Model , author=. The Twelfth International Conference on Learning Representations , year=

work page
[51]

2024 , eprint=

Keypoint-based Progressive Chain-of-Thought Distillation for LLMs , author=. 2024 , eprint=

work page 2024
[52]

SCOTT : Self-Consistent Chain-of-Thought Distillation

Wang, Peifeng and Wang, Zhengyang and Li, Zheng and Gao, Yifan and Yin, Bing and Ren, Xiang. SCOTT : Self-Consistent Chain-of-Thought Distillation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.304

work page doi:10.18653/v1/2023.acl-long.304 2023
[53]

M o DE - C o TD : Chain-of-Thought Distillation for Complex Reasoning Tasks with Mixture of Decoupled L o RA -Experts

Li, Xiang and He, Shizhu and Wu, Jiayu and Yang, Zhao and Xu, Yao and Jun, Yang jun and Liu, Haifeng and Liu, Kang and Zhao, Jun. M o DE - C o TD : Chain-of-Thought Distillation for Complex Reasoning Tasks with Mixture of Decoupled L o RA -Experts. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and ...

work page 2024
[54]

2015 , eprint=

FitNets: Hints for Thin Deep Nets , author=. 2015 , eprint=

work page 2015
[55]

2019 , eprint=

Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , author=. 2019 , eprint=

work page 2019
[57]

The Twelfth International Conference on Learning Representations , year=

The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations , year=

work page
[58]

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances

Zicheng Lin and Tian Liang and Jiahao Xu and Qiuzhi Liu and Xing Wang and Ruilin Luo and Chufan Shi and Siheng Li and Yujiu Yang and Zhaopeng Tu , booktitle=. Critical Tokens Matter: Token-Level Contrastive Estimation Enhances. 2025 , url=

work page 2025
[60]

2024 , eprint=

A Survey on Knowledge Distillation of Large Language Models , author=. 2024 , eprint=

work page 2024
[63]

2023 , eprint=

Less is More: Task-aware Layer-wise Distillation for Language Model Compression , author=. 2023 , eprint=

work page 2023
[65]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[66]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022
[67]

2019 , url=

Language Models are Unsupervised Multitask Learners , author=. 2019 , url=

work page 2019
[68]

ArXiv , year=

Multi-Task Learning with Deep Neural Networks: A Survey , author=. ArXiv , year=

work page
[70]

Proceedings of the AAAI conference on artificial intelligence , volume=

Conceptnet 5.5: An open multilingual graph of general knowledge , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[71]

Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan , journal =

work page
[72]

ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=

work page
[74]

Forty-second International Conference on Machine Learning , year=

Deliberation in Latent Space via Differentiable Cache Augmentation , author=. Forty-second International Conference on Machine Learning , year=

work page
[75]

Forty-second International Conference on Machine Learning , year=

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning , author=. Forty-second International Conference on Machine Learning , year=

work page
[76]

2011 , url=

Thinking, Fast and Slow , author=. 2011 , url=

work page 2011
[77]

2025 , eprint=

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond , author=. 2025 , eprint=

work page 2025
[78]

2025 , eprint=

Efficient Reasoning Models: A Survey , author=. 2025 , eprint=

work page 2025
[81]

ArXiv , year=

Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning , author=. ArXiv , year=

work page
[82]

The Thirteenth International Conference on Learning Representations , year=

Reasoning with Latent Thoughts: On the Power of Looped Transformers , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[83]

The Twelfth International Conference on Learning Representations , year=

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. The Twelfth International Conference on Learning Representations , year=

work page
[84]

Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes

Rao Vijjini, Anvesh and Anuranjana, Kaveri and Mamidi, Radhika. Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes. Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2021

work page 2021
[85]

Anthropic. 2024. https://www.anthropic.com/news/claude-3 -5-sonnet Claude 3.5 sonnet

work page 2024
[86]

Jeffrey Cheng and Benjamin Van Durme. 2024. https://arxiv.org/abs/2412.13171 Compressed chain of thought: Efficient reasoning through dense representations . Preprint, arXiv:2412.13171

work page internal anchor Pith review Pith/arXiv arXiv 2024
[87]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://api.semanticscholar.org/CorpusID:239998651 Training verifiers to solve math word problems . ArXiv, abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[88]

Michael Crawshaw. 2020. https://api.semanticscholar.org/CorpusID:221819295 Multi-task learning with deep neural networks: A survey . ArXiv, abs/2009.09796

work page arXiv 2020
[89]

Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. https://api.semanticscholar.org/CorpusID:269982648 From explicit cot to implicit cot: Learning to internalize cot step by step . ArXiv, abs/2405.14838

work page internal anchor Pith review Pith/arXiv arXiv 2024
[90]

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. 2023. https://api.semanticscholar.org/CorpusID:264935229 Implicit chain of thought reasoning via knowledge distillation . ArXiv, abs/2311.01460

work page arXiv 2023
[91]

Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vuli \'c . 2025. https://doi.org/10.18653/v1/2025.naacl-long.444 UNDIAL : Self-distillation with adjusted logits for robust unlearning in large language models . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

work page doi:10.18653/v1/2025.naacl-long.444 2025
[92]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. https://proceedings.mlr.press/v202/gao23f.html PAL : Program-aided language models . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10764--10799. PMLR

work page 2023
[93]

Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. 2024. https://openreview.net/forum?id=uREj4ZuGJE In-context autoencoder for context compression in a large language model . In The Twelfth International Conference on Learning Representations

work page 2024
[94]

Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. 2025. https://openreview.net/forum?id=D6o6Bwtq7h Scaling up test-time compute with latent reasoning: A recurrent depth approach . In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

work page 2025
[95]

Google. 2024. https://blog.google/techno logy/ai/google-gemini-next-generation-model-february-2024 Our next-generation model: Gemini 1.5

work page 2024

Showing first 80 references.