arxiv: 2604.22709 · v2 · submitted 2026-04-24 · 💻 cs.CL

Recognition: unknown

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Keshav Ramji , Tahira Naseem , Ram\'on Fernandez Astudillo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords abstract chain-of-thoughtlatent reasoningchain-of-thought compressionefficient inferencediscrete latent tokensself-distillationconstrained decodingreasoning token efficiency

0 comments

The pith

Language models can reason by emitting short sequences of abstract tokens instead of natural-language chains of thought.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a language model can be post-trained to output a brief sequence drawn from a reserved abstract vocabulary before producing its final answer, replacing the usual lengthy verbal reasoning trace. This change is achieved through an initial warm-up phase that alternates between compressing verbal chains via masking and self-distillation under constrained decoding, followed by reinforcement learning that further optimizes the abstract sequences. If successful, the approach yields up to 11.6 times fewer reasoning tokens while preserving performance on mathematical problems, instruction following, and multi-hop reasoning tasks, and it works across different model families. The abstract tokens also develop a power-law frequency distribution similar to words in natural language.

Core claim

Abstract Chain-of-Thought lets a language model generate a short sequence of tokens from a reserved vocabulary in place of a natural-language chain-of-thought before producing the response. Training begins with a policy-iteration warm-up that alternates masking-based bottlenecking from verbal CoT plus supervised fine-tuning, then self-distillation to generate abstract tokens from the prompt alone via constrained decoding; after warm-up, warm-started reinforcement learning under constrained decoding optimizes the abstract sequences.

What carries the argument

Abstract Chain-of-Thought, a discrete latent reasoning mechanism in which the model produces tokens from a reserved vocabulary instead of verbal reasoning steps.

If this is right

Reasoning token usage drops by up to 11.6 times while accuracy remains comparable on mathematical, instruction-following, and multi-hop tasks.
The method transfers across language model families without requiring architecture changes.
An emergent power-law frequency distribution develops over the abstract vocabulary and evolves across training phases.
Post-training alone can install latent reasoning that reduces inference cost without altering the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the abstract tokens encode reusable reasoning structure, they could enable faster transfer to new domains than verbal chains do.
The learned abstract sequences might allow direct inspection of internal reasoning steps without needing natural-language translation.
Dynamic switching between verbal and abstract modes could further optimize cost on tasks of varying difficulty.
The power-law pattern suggests the abstract vocabulary may scale in complexity similarly to natural language when models grow larger.

Load-bearing premise

The two-stage warm-up plus reinforcement learning under constrained decoding can teach the model to use the new abstract tokens for genuine reasoning rather than superficial pattern matching.

What would settle it

Performance on held-out reasoning tasks falls to the level of a model that outputs random tokens from the same reserved vocabulary.

Figures

Figures reproduced from arXiv: 2604.22709 by Keshav Ramji, Ram\'on Fernandez Astudillo, Tahira Naseem.

**Figure 1.** Figure 1: Verbalized vs. Abstract Chain-of-Thought. Verbalized CoT (left) generates an explicit natural language rationale (Step 1 through Step 8) inside <think> · · · </think> tags before producing the answer. Abstract CoT (right) instead emits a short sequence of tokens from the reserved abstract vocabulary inside <beginabstract> · · · <endabstract> delimiters, achieving the same answer with substantially fewer re… view at source ↗

**Figure 2.** Figure 2: Abstract Chain-of-Thought: The training recipe consists of two stages: (i.) a warm-up loop, consisting of a Bottlenecked SFT phase with guidance from a teacher Verbal CoT, and a Self-Distillation phase with on-policy abstract sequence generation, repeated iteratively, and (ii.) reinforcement learning using GRPO with constrained decoding for the rollouts, which rewards abstract sequences that lead to a high… view at source ↗

**Figure 3.** Figure 3: Plot of average generated (reasoning + response) tokens (log-scale) vs. benchmark view at source ↗

**Figure 4.** Figure 4: (Left) Evolution of the abstract token distribution over 1M episodes of warm view at source ↗

**Figure 5.** Figure 5: MATH-500 abstract vocabulary scaling ablation across stages of the Abstract-CoT view at source ↗

**Figure 6.** Figure 6: AlpacaEval abstract vocabulary scaling ablation. The results show clearer differen view at source ↗

**Figure 7.** Figure 7: HotpotQA abstract vocabulary scaling ablation. The results again show clearly view at source ↗

**Figure 8.** Figure 8: Scaling ablation with M = 2 abstract token vocabulary view at source ↗

**Figure 9.** Figure 9: Scaling ablation with M = 4 abstract token vocabulary view at source ↗

**Figure 11.** Figure 11: Scaling ablation with M = 16 abstract token vocabulary view at source ↗

**Figure 14.** Figure 14: Scaling ablation with M = 128 abstract token vocabulary. 19 view at source ↗

**Figure 10.** Figure 10: Scaling ablation with M = 8 abstract token vocabulary view at source ↗

**Figure 12.** Figure 12: Scaling ablation with M = 32 abstract token vocabulary view at source ↗

**Figure 15.** Figure 15: Scaling ablation with M = 256 abstract token vocabulary view at source ↗

**Figure 16.** Figure 16: Scaling ablation with M = 512 abstract token vocabulary. A.1.2 Cold-Start RL Frequency Distribution For comparison, we include the frequency distribution with M = 64 for cold-start RL training in view at source ↗

**Figure 13.** Figure 13: Scaling ablation with M = 64 abstract token vocabulary. Note that this is the same as view at source ↗

**Figure 17.** Figure 17: Cold-start RL in the scaling ablation with view at source ↗

read the original abstract

While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ''abstract'' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete training recipe for discrete latent reasoning via reserved abstract tokens that cuts length by 10x+, but the evidence that those tokens do actual intermediate reasoning rather than compressed prediction is still thin.

read the letter

The main takeaway is a post-training method that replaces long verbal chains of thought with short sequences drawn from a reserved abstract vocabulary. The authors use a warm-up that first bottlenecks verbal CoT through masking for supervised fine-tuning, then self-distills the abstract tokens from the prompt alone under constrained decoding, and finally runs RL to optimize the abstract sequences. They report up to 11.6 times fewer reasoning tokens with performance that stays comparable on math, instruction-following, and multi-hop tasks, and the approach transfers across model families. An emergent power-law distribution over the abstract vocabulary is also noted as training progresses.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Abstract Chain-of-Thought (Abstract-CoT), a post-training method in which language models generate short sequences of discrete tokens drawn from a reserved abstract vocabulary in place of explicit natural-language chains-of-thought. The approach uses a two-stage warm-up (masking-based bottlenecking followed by self-distillation under constrained decoding) and subsequent reinforcement learning to optimize abstract-sequence generation. It reports up to 11.6× reduction in reasoning tokens while maintaining comparable performance on mathematical reasoning, instruction-following, and multi-hop reasoning tasks, with generalization across model families and an emergent power-law distribution over the abstract vocabulary.

Significance. If the abstract tokens can be shown to carry genuine intermediate reasoning content rather than serving as compressed predictors, the method would offer a practical route to substantially lower inference cost for reasoning workloads while preserving performance. The cross-family generalization and the observed power-law statistics are noteworthy and, if robust, would strengthen the case for learned discrete latent reasoning languages.

major comments (2)

[Method (warm-up loop and RL stage)] The central claim that abstract sequences implement latent reasoning (rather than direct prompt-to-answer mappings) is load-bearing yet unsupported by the described procedure. The two-stage warm-up plus RL under constrained decoding permits the model to learn abstract tokens as compressed class labels or surface cues distilled from verbal CoT; no causal intervention, information-probing, or out-of-distribution test is described that would distinguish these alternatives.
[Experiments and Results] No ablation studies, error bars, training curves, or per-task quantitative tables are referenced that would allow assessment of whether performance is truly comparable or whether RL instabilities occurred. The 11.6× token-reduction figure therefore cannot be evaluated for reliability or sensitivity to the abstract-vocabulary size hyperparameter.

minor comments (2)

[Abstract and Method] The phrase 'policy iteration-style warm-up loop' is used without specifying the exact alternation schedule, reward shaping, or constrained-decoding implementation details needed for reproducibility.
[Abstract] A concrete example showing a prompt, the generated abstract token sequence, and the corresponding verbal CoT would help readers understand the learned mapping.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with point-by-point responses, indicating planned revisions where appropriate to strengthen the work.

read point-by-point responses

Referee: The central claim that abstract sequences implement latent reasoning (rather than direct prompt-to-answer mappings) is load-bearing yet unsupported by the described procedure. The two-stage warm-up plus RL under constrained decoding permits the model to learn abstract tokens as compressed class labels or surface cues distilled from verbal CoT; no causal intervention, information-probing, or out-of-distribution test is described that would distinguish these alternatives.

Authors: We agree that direct evidence distinguishing latent reasoning from compressed direct mappings would strengthen the central claim. The warm-up procedure is explicitly designed to enforce an information bottleneck: verbal CoT is masked during supervised fine-tuning, and constrained decoding during self-distillation forces the model to produce and rely on abstract tokens before generating the answer. The subsequent RL stage further optimizes the abstract sequences for task performance. Indirect support comes from the emergent power-law distribution over the abstract vocabulary (evolving across phases, akin to natural language) and consistent generalization across model families, which would be unlikely for arbitrary class labels. We will add a dedicated discussion subsection addressing alternative interpretations and outlining future probing experiments (e.g., mutual information analysis between abstract tokens and reasoning steps). revision: partial
Referee: No ablation studies, error bars, training curves, or per-task quantitative tables are referenced that would allow assessment of whether performance is truly comparable or whether RL instabilities occurred. The 11.6× token-reduction figure therefore cannot be evaluated for reliability or sensitivity to the abstract-vocabulary size hyperparameter.

Authors: We acknowledge that the current manuscript lacks these details, limiting assessment of robustness. In the revised version we will expand the Experiments and Results sections to include: ablation studies on the two warm-up stages and on abstract vocabulary size (32/64/128/256 tokens); error bars computed over three random seeds for all main results; training curves for the RL phase showing reward and token-length trajectories; and expanded per-task tables reporting accuracy, token counts, and reduction ratios. The 11.6× figure is the peak reduction observed on GSM8K with vocabulary size 128; we will add a sensitivity table across vocabulary sizes and clarify that all reported numbers use the same constrained decoding setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard empirical training procedures

full rationale

The paper presents Abstract-CoT as a post-training procedure using masking-based bottlenecking, supervised fine-tuning, self-distillation via constrained decoding, and subsequent reinforcement learning. No equations, predictions, or first-principles claims are described that reduce the token reduction or performance results to a fitted parameter or self-referential definition by construction. The power-law observation over abstract tokens is reported as an empirical finding, not a derived necessity. The method is self-contained against external benchmarks via reported experiments on math, instruction, and multi-hop tasks, with no load-bearing self-citations or ansatzes that loop back to the target claims.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the assumption that abstract tokens can be made semantically useful through the described warm-up and that the RL stage improves rather than degrades reasoning quality.

free parameters (1)

abstract vocabulary size
Size of the reserved token set is a free hyperparameter whose value is not stated and must be chosen to enable useful compression.

invented entities (1)

abstract tokens from reserved vocabulary no independent evidence
purpose: Serve as discrete latent reasoning representations that replace natural-language CoT
New vocabulary is introduced without prior independent evidence that it can carry reasoning content.

pith-pipeline@v0.9.0 · 5557 in / 1277 out tokens · 52183 ms · 2026-05-08T11:46:52.783445+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dynamic Latent Routing
cs.LG 2026-05 unverdicted novelty 7.0

Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across fou...
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
cs.AI 2026-05 unverdicted novelty 3.0

Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

Reference graph

Works this paper leans on

50 extracted references · 21 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

L1: Controlling how long a reasoning model thinks with reinforcement learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=4jdIxXBNve

2025
[2]

Dolci-Think-RL-7B

AI2. Dolci-Think-RL-7B . Hugging Face Datasets, 2025 a . URL https://huggingface.co/datasets/allenai/Dolci-Think-RL-7B. Accessed: 2026-02-08

2025
[3]

Dolci-Think-SFT-7B

AI2. Dolci-Think-SFT-7B . Hugging Face Datasets, 2025 b . URL https://huggingface.co/datasets/allenai/Dolci-Think-SFT-7B. Accessed: 2026-02-08

2025
[4]

AIME problems and solutions

Art of Problem Solving . AIME problems and solutions. https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions, 2025. Accessed: 2026-04-24

2025
[5]

Soft tokens, hard truths

Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens, hard truths, 2025. URL https://arxiv.org/abs/2509.19170

work page arXiv 2025
[6]

Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations, 2024. URL https://arxiv.org/abs/2412.13171

work page arXiv 2024
[7]

From explicit cot to implicit cot: Learning to internalize cot step by step

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step, 2024. URL https://arxiv.org/abs/2405.14838

work page arXiv 2024
[8]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal \'a zs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review arXiv 2024
[9]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ph04CRkPdC

2024
[10]

Granite 3.0 language models, October 2024

IBM Granite Team. Granite 3.0 language models, October 2024. URL https://github.com/ibm-granite/granite-3.0-language-models/

2024
[11]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[12]

Training large language model to reason in a continuous latent space, 2025

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, and Yuandong Tian. Training large language model to reason in a continuous latent space, 2025. URL https://openreview.net/forum?id=tG4SgayTtk

2025
[13]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe

2021
[14]

Thinkprune: Pruning long chain-of-thought of LLM s via reinforcement learning

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of LLM s via reinforcement learning. Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/forum?id=V51gPu1uQD

2026
[15]

Distilling step-by-step: Outperforming larger language models with less training data

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step: Outperforming larger language models with less training data. In Findings of the Association for Computational Linguistics: ACL 2023, 2023

2023
[16]

Expanding computation spaces of large language models at inference time, 2025

Yoonna Jang, Kisu Yang, and Isabelle Augenstein. Expanding computation spaces of large language models at inference time, 2025. URL https://arxiv.org/abs/2509.24884

work page arXiv 2025
[17]

e1: Learning adaptive control of reasoning effort.arXiv preprint arXiv:2510.27042, 2025

Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, and Stefano Soatto. e1: Learning adaptive control of reasoning effort, 2025. URL https://arxiv.org/abs/2510.27042

work page arXiv 2025
[18]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088

2022
[19]

Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv: 2507.11473, 2025

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

work page arXiv 2025
[20]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwel...

work page Pith review arXiv 2023
[21]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

2021
[22]

Numinamath

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/num...

2024
[23]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021

2021
[24]

Pause tokens strictly increase the expressivity of constant-depth transformers

Charles London and Varun Kanade. Pause tokens strictly increase the expressivity of constant-depth transformers. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=eG5oh8l1WZ

2025
[25]

Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning.arXiv preprint arXiv:2505.11896, 2025

Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, and Shuangzhi Wu. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning, 2025. URL https://arxiv.org/abs/2505.11896

work page arXiv 2025
[26]

Exact expressive power of transformers with padding

William Merrill and Ashish Sabharwal. Exact expressive power of transformers with padding. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=O1abxStFcy

2025
[27]

Learning to compress prompts with gist tokens

Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=2DtxPCL3T5

2023
[28]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Lang...

work page doi:10.18653/v1/2025.emnlp-main.1025 2025
[29]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/2508.10925

work page internal anchor Pith review arXiv 2025
[31]

Jacob Pfau, William Merrill, and Samuel R. Bowman. Let s think dot by dot: Hidden computation in transformer language models. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=NikbrdtYvG

2024
[32]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

2024
[33]

Shah, Khush Gupta, Keshav Ramji, and Pratik Chaudhari

Alok N. Shah, Khush Gupta, Keshav Ramji, and Pratik Chaudhari. Language modeling with learned meta-tokens, 2025. URL https://arxiv.org/abs/2509.16278

work page arXiv 2025
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review arXiv 2024
[35]

Hybridcot: Interleaving latent and text chain-of-thought for efficient reasoning, 2026

Shannon Zejiang Shen, Rulin Shao, Chenyu Wang, Songlin Yang, Vincent-Pierre Berges, Gargi Ghosh, Pang Wei Koh, Luke Zettlemoyer, Yoon Kim, Jason E Weston, David Sontag, and Wen tau Yih. Hybridcot: Interleaving latent and text chain-of-thought for efficient reasoning, 2026. URL https://openreview.net/forum?id=4mfGbMzTwu

2026
[36]

CODI: compress- ing chain-of-thought into continuous space via self-distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. CODI : Compressing chain-of-thought into continuous space via self-distillation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 677--693, Suzhou, C...

work page doi:10.18653/v1/2025.emnlp-main.36 2025
[37]

Token assorted: Mixing latent and text tokens for improved language model reasoning

DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, and Qinqing Zheng. Token assorted: Mixing latent and text tokens for improved language model reasoning. In Proceedings of the 42nd International Conference on Machine Learning, 2025

2025
[38]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=bzs4uPLXvi

2023
[39]

System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts

Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, and Bang Liu. System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=MNduv07wAu

2025
[40]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088

2022
[41]

Tokenskip: Controlling chain-of-thought compression for efficient reasoning

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controlling chain-of-thought compression for efficient reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

2025
[42]

S oft C o T : Soft chain-of-thought for efficient reasoning with LLM s

Yige Xu, Xu Guo, Zhiwei Zeng, and Chunyan Miao. S oft C o T : Soft chain-of-thought for efficient reasoning with LLM s. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 23336--23351, Vienna, Austria, J...

work page doi:10.18653/v1/2025.acl-long.1137 2025
[43]

From long to lean: Performance-aware and adaptive chain-of-thought compression via multi-round refinement

JianZhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Zike Yuan, Yang Xiang, and Buzhou Tang. From long to lean: Performance-aware and adaptive chain-of-thought compression via multi-round refinement. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

2025
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review arXiv 2025
[45]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. H otpot QA : A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proce...

work page doi:10.18653/v1/d18-1259 2018
[46]

Lightthinker: Thinking step-by-step compression

Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Da, Da Zheng, Huajun Chen, and Ningyu Zhang. Lightthinker: Thinking step-by-step compression. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025 a

2025
[47]

Extract the final answer from:

Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space, 2025 b . URL https://arxiv.org/abs/2505.15778

work page arXiv 2025
[48]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[49]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[50]

thinking mode

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.5555/3600270.3602070 2025