CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
Pith reviewed 2026-05-17 23:11 UTC · model grok-4.3
The pith
Self-distillation aligns one token's hidden state to transfer chain-of-thought reasoning into continuous space without accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CODI jointly trains a teacher on explicit natural-language chain-of-thought and a student on implicit continuous-space reasoning, then distills the teacher's reasoning capability into the student by aligning the hidden state of a single designated token. This alignment transfers the multi-step reasoning process into latent space, allowing the student to match the teacher's accuracy on GSM8k while using a 3.1 times shorter representation.
What carries the argument
Self-distillation via alignment of the hidden state of one designated token between the explicit CoT teacher and the implicit CoT student.
Load-bearing premise
Aligning the hidden states of a single designated token is enough to transfer the full reasoning capability from language to continuous space without loss or distortion.
What would settle it
Train the CODI student on GSM8k with the stated alignment loss and check whether its final accuracy remains within a few points of the explicit teacher's accuracy.
read the original abstract
Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by encouraging step-by-step reasoning in natural language. However, leveraging a latent continuous space for reasoning may offer benefits in terms of both efficiency and robustness. Prior implicit CoT methods attempt to bypass language completely by reasoning in continuous space but have consistently underperformed compared to the standard explicit CoT approach. We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space. CODI jointly trains a teacher task (Explicit CoT) and a student task (Implicit CoT), distilling the reasoning ability from language into continuous space by aligning the hidden states of a designated token. Our experiments show that CODI is the first implicit CoT approach to match the performance of explicit CoT on GSM8k at the GPT-2 scale, achieving a 3.1x compression rate and outperforming the previous state-of-the-art by 28.2% in accuracy. CODI also demonstrates robustness, generalizable to complex datasets, and interpretability. These results validate that LLMs can reason effectively not only in natural language, but also in a latent continuous space. Code is available at https://github.com/zhenyi4/codi.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CODI, a self-distillation framework that jointly trains an explicit CoT teacher and an implicit CoT student model. Reasoning is compressed into continuous space by aligning the hidden states of a single designated token between teacher and student, bypassing explicit language steps. On GSM8K at GPT-2 scale, the method claims to match explicit CoT accuracy for the first time among implicit approaches, with a 3.1x compression rate and 28.2% accuracy gain over prior implicit CoT SOTA; additional claims include robustness and interpretability.
Significance. If the central empirical result holds, this would be a notable contribution to implicit reasoning in LLMs by demonstrating that latent continuous space can achieve parity with explicit language-based CoT without sacrificing accuracy. The self-distillation approach, reported compression factor, and open-sourced code (https://github.com/zhenyi4/codi) are strengths that support reproducibility and further exploration of continuous-space reasoning.
major comments (2)
- Method section: The distillation relies on aligning the hidden state of only one designated token between the explicit teacher and implicit student. Because explicit CoT performs autoregressive multi-step reasoning where each token's representation conditions on prior steps, a single-vector target risks encoding only a summary embedding rather than the full reasoning trajectory; performance parity could then arise from direct question-to-answer mapping instead of genuine latent reasoning. Ablations comparing single-token vs. multi-token or trajectory alignment are needed to substantiate the claim.
- Experiments section: The reported accuracy gains and SOTA outperformance lack error bars, variance across runs, or detailed baseline comparisons and ablations on the designated token choice. These omissions make it difficult to evaluate whether the 28.2% improvement and explicit-CoT parity are robust or sensitive to implementation details.
minor comments (2)
- Abstract and results: Claims of 'robustness' and 'interpretability' are stated without accompanying quantitative metrics or analysis in the provided summary; these should be supported by specific numbers or figures.
- Notation: The description of the 'designated token' is somewhat vague; clarifying its identity (e.g., final token, special token) and how its hidden state is extracted would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential significance of demonstrating parity between implicit and explicit CoT at this scale. We address each major comment below with clarifications and commitments to revisions that strengthen the empirical support without altering the core claims.
read point-by-point responses
-
Referee: Method section: The distillation relies on aligning the hidden state of only one designated token between the explicit teacher and implicit student. Because explicit CoT performs autoregressive multi-step reasoning where each token's representation conditions on prior steps, a single-vector target risks encoding only a summary embedding rather than the full reasoning trajectory; performance parity could then arise from direct question-to-answer mapping instead of genuine latent reasoning. Ablations comparing single-token vs. multi-token or trajectory alignment are needed to substantiate the claim.
Authors: We agree that the choice of alignment target merits further scrutiny. In CODI the designated token is the final token of the input sequence; because the teacher processes the full explicit CoT autoregressively, this token's hidden state is conditioned on every preceding reasoning step. Nevertheless, to directly address the concern that performance might reflect a shortcut rather than latent reasoning, we will add a new ablation subsection in the revised manuscript. It will compare (i) single final-token alignment, (ii) alignment over the last k tokens, and (iii) a trajectory-level loss that matches hidden states at every reasoning step. We will report accuracy, compression ratio, and training dynamics for each variant on GSM8K, thereby providing evidence that the single-token design captures the essential trajectory while remaining efficient. revision: yes
-
Referee: Experiments section: The reported accuracy gains and SOTA outperformance lack error bars, variance across runs, or detailed baseline comparisons and ablations on the designated token choice. These omissions make it difficult to evaluate whether the 28.2% improvement and explicit-CoT parity are robust or sensitive to implementation details.
Authors: We acknowledge the absence of statistical reporting and token-choice ablations in the current draft. In the revision we will rerun all main experiments and baselines with five random seeds, reporting mean accuracy together with standard deviation. We will also expand the experimental section with (a) a table of additional baselines that includes prior implicit CoT methods with their original hyper-parameters re-implemented under our training regime, and (b) an ablation varying the position and number of designated tokens. These additions will allow readers to assess both robustness and sensitivity to the alignment choice. revision: yes
Circularity Check
No significant circularity in empirical training framework
full rationale
CODI is an empirical self-distillation training procedure that jointly optimizes an explicit-CoT teacher and an implicit-CoT student by aligning hidden states of one designated token; performance numbers on GSM8k and other benchmarks are obtained from standard supervised fine-tuning and evaluation runs rather than from any closed-form derivation or equation that reduces the reported accuracy to a fitted parameter by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the method description or results. The central claim therefore remains externally falsifiable through replication on held-out data and does not collapse into its own inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
PLUME: Latent Reasoning Based Universal Multimodal Embedding
PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
-
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
-
SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving
SpiralThinker stabilizes iterative latent reasoning in LLMs via text-latent interleaving and progressive alignment, achieving SOTA results among latent baselines on math, logic, and commonsense tasks.
-
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
-
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
-
When Less is Enough: Efficient Inference via Collaborative Reasoning
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
LEPO: Latent Reasoning Policy Optimization for Large Language Models
LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.
-
MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration
MemoSight unifies context compression and multi-token prediction via special tokens and tailored position layouts to reduce KV cache by up to 66% and accelerate inference by 1.56x while outperforming prior CoT compres...
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
-
Mull-Tokens: Modality-Agnostic Latent Thinking
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
-
LEPO: Latent Reasoning Policy Optimization for Large Language Models
LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
-
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
-
ConFu: Contemplate the Future for Better Speculative Sampling
ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.
-
Deep Thinking by Markov Chain of Continuous Thoughts
MarCos modifies transformers to perform continuous multi-step reasoning by mapping thought-level continuous states directly to next-thought distributions, achieving substantial wall-clock speedups on math problems.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
-
OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
OneSearch-V2 improves generative retrieval via latent reasoning and self-distillation, achieving +3.98% item CTR, +2.07% buyer volume, and +2.11% order volume in online A/B tests.
Reference graph
Works this paper leans on
-
[1]
Training language models to follow instructions with human feedback , url =
Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
- [2]
-
[3]
Publications Manual , year = "1983", publisher =
work page 1983
-
[4]
Journal of the ACM (JACM)28(1), 114–133 (1981) https://doi.org/10.1145/322234.322243 24
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [5]
-
[6]
Dan Gusfield , title =. 1997
work page 1997
-
[7]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[8]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Numpages =
-
[9]
Training Large Language Models to Reason in a Continuous Latent Space , author=. 2024 , eprint=
work page 2024
-
[10]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
In-Context Learning State Vector with Inner and Momentum Optimization , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[11]
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering , author=. ArXiv , year=
-
[12]
Implicit Chain of Thought Reasoning via Knowledge Distillation , author=. ArXiv , year=
-
[13]
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step , author=. ArXiv , year=
-
[14]
The Eleventh International Conference on Learning Representations , year=
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought , author=. The Eleventh International Conference on Learning Representations , year=
- [15]
-
[17]
Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , booktitle =. 2023 , editor =
work page 2023
-
[20]
Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=
-
[21]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =
-
[22]
The Thirteenth International Conference on Learning Representations , year=
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning , author=. The Thirteenth International Conference on Learning Representations , year=
-
[23]
Large Language Models are Zero-Shot Reasoners , url =
Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , booktitle =. Large Language Models are Zero-Shot Reasoners , url =
-
[24]
The Twelfth International Conference on Learning Representations , year=
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[25]
Graph of Thoughts: Solving Elaborate Problems with Large Language Models , volume=
Besta, Maciej and Blach, Nils and Kubicek, Ales and Gerstenberger, Robert and Podstawski, Michal and Gianinazzi, Lukas and Gajda, Joanna and Lehmann, Tomasz and Niewiadomski, Hubert and Nyczyk, Piotr and Hoefler, Torsten , year=. Graph of Thoughts: Solving Elaborate Problems with Large Language Models , volume=. Proceedings of the AAAI Conference on Artif...
-
[26]
Reasoning with Language Model is Planning with World Model , author=. 2023 , eprint=
work page 2023
-
[27]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , author=. 2024 , eprint=
work page 2024
-
[28]
Jacob Pfau and William Merrill and Samuel R. Bowman , booktitle=. Let. 2024 , url=
work page 2024
-
[29]
The Twelfth International Conference on Learning Representations , year=
Think before you speak: Training Language Models With Pause Tokens , author=. The Twelfth International Conference on Learning Representations , year=
-
[30]
Hadas Orgad and Michael Toker and Zorik Gekhman and Roi Reichart and Idan Szpektor and Hadas Kotek and Yonatan Belinkov , booktitle=. 2025 , url=
work page 2025
-
[31]
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations , author=. 2024 , eprint=
work page 2024
-
[32]
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. 2021 , eprint=
work page 2021
- [33]
- [34]
- [35]
- [36]
-
[37]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[38]
Philosophical Magazine , volume =
Karl Pearson , title =. Philosophical Magazine , volume =
- [39]
-
[40]
Deep Residual Learning for Image Recognition , author=. 2015 , eprint=
work page 2015
-
[41]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=
work page 2019
-
[42]
Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=
work page 2015
-
[43]
Hinton and Oriol Vinyals and Jeffrey Dean , title =
Geoffrey E. Hinton and Oriol Vinyals and Jeffrey Dean , title =. CoRR , volume =. 2015 , url =
work page 2015
-
[44]
M. Amalric and S. Dehaene , title =. Proceedings of the National Academy of Sciences , volume =. 2016 , doi =
work page 2016
-
[45]
M. Amalric and S. Dehaene , title =. NeuroImage , volume =. 2019 , month =. doi:10.1016/j.neuroimage.2019.01.001 , url =
- [46]
-
[47]
The Twelfth International Conference on Learning Representations , year=
In-context Autoencoder for Context Compression in a Large Language Model , author=. The Twelfth International Conference on Learning Representations , year=
-
[51]
Keypoint-based Progressive Chain-of-Thought Distillation for LLMs , author=. 2024 , eprint=
work page 2024
-
[52]
SCOTT : Self-Consistent Chain-of-Thought Distillation
Wang, Peifeng and Wang, Zhengyang and Li, Zheng and Gao, Yifan and Yin, Bing and Ren, Xiang. SCOTT : Self-Consistent Chain-of-Thought Distillation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.304
-
[53]
Li, Xiang and He, Shizhu and Wu, Jiayu and Yang, Zhao and Xu, Yao and Jun, Yang jun and Liu, Haifeng and Liu, Kang and Zhao, Jun. M o DE - C o TD : Chain-of-Thought Distillation for Complex Reasoning Tasks with Mixture of Decoupled L o RA -Experts. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and ...
work page 2024
- [54]
-
[55]
Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , author=. 2019 , eprint=
work page 2019
-
[57]
The Twelfth International Conference on Learning Representations , year=
The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations , year=
-
[58]
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances
Zicheng Lin and Tian Liang and Jiahao Xu and Qiuzhi Liu and Xing Wang and Ruilin Luo and Chufan Shi and Siheng Li and Yujiu Yang and Zhaopeng Tu , booktitle=. Critical Tokens Matter: Token-Level Contrastive Estimation Enhances. 2025 , url=
work page 2025
-
[60]
A Survey on Knowledge Distillation of Large Language Models , author=. 2024 , eprint=
work page 2024
-
[63]
Less is More: Task-aware Layer-wise Distillation for Language Model Compression , author=. 2023 , eprint=
work page 2023
-
[65]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[66]
Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=
work page 2022
-
[67]
Language Models are Unsupervised Multitask Learners , author=. 2019 , url=
work page 2019
-
[68]
Multi-Task Learning with Deep Neural Networks: A Survey , author=. ArXiv , year=
-
[70]
Proceedings of the AAAI conference on artificial intelligence , volume=
Conceptnet 5.5: An open multilingual graph of general knowledge , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[71]
Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan , journal =
-
[72]
ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=
-
[74]
Forty-second International Conference on Machine Learning , year=
Deliberation in Latent Space via Differentiable Cache Augmentation , author=. Forty-second International Conference on Machine Learning , year=
-
[75]
Forty-second International Conference on Machine Learning , year=
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning , author=. Forty-second International Conference on Machine Learning , year=
- [76]
-
[77]
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond , author=. 2025 , eprint=
work page 2025
- [78]
-
[81]
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning , author=. ArXiv , year=
-
[82]
The Thirteenth International Conference on Learning Representations , year=
Reasoning with Latent Thoughts: On the Power of Looped Transformers , author=. The Thirteenth International Conference on Learning Representations , year=
-
[83]
The Twelfth International Conference on Learning Representations , year=
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. The Twelfth International Conference on Learning Representations , year=
-
[84]
Rao Vijjini, Anvesh and Anuranjana, Kaveri and Mamidi, Radhika. Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes. Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2021
work page 2021
-
[85]
Anthropic. 2024. https://www.anthropic.com/news/claude-3 -5-sonnet Claude 3.5 sonnet
work page 2024
-
[86]
Jeffrey Cheng and Benjamin Van Durme. 2024. https://arxiv.org/abs/2412.13171 Compressed chain of thought: Efficient reasoning through dense representations . Preprint, arXiv:2412.13171
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[87]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://api.semanticscholar.org/CorpusID:239998651 Training verifiers to solve math word problems . ArXiv, abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [88]
-
[89]
Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. https://api.semanticscholar.org/CorpusID:269982648 From explicit cot to implicit cot: Learning to internalize cot step by step . ArXiv, abs/2405.14838
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [90]
-
[91]
Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vuli \'c . 2025. https://doi.org/10.18653/v1/2025.naacl-long.444 UNDIAL : Self-distillation with adjusted logits for robust unlearning in large language models . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...
-
[92]
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. https://proceedings.mlr.press/v202/gao23f.html PAL : Program-aided language models . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10764--10799. PMLR
work page 2023
-
[93]
Tao Ge, Hu Jing, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. 2024. https://openreview.net/forum?id=uREj4ZuGJE In-context autoencoder for context compression in a large language model . In The Twelfth International Conference on Learning Representations
work page 2024
-
[94]
Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein
Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. 2025. https://openreview.net/forum?id=D6o6Bwtq7h Scaling up test-time compute with latent reasoning: A recurrent depth approach . In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
work page 2025
-
[95]
Google. 2024. https://blog.google/techno logy/ai/google-gemini-next-generation-model-february-2024 Our next-generation model: Gemini 1.5
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.