Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Benjamin Van Durme; Jeffrey Cheng

arxiv: 2412.13171 · v1 · pith:NRUDAJ65new · submitted 2024-12-17 · 💻 cs.CL

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Jeffrey Cheng , Benjamin Van Durme This is my paper

Pith reviewed 2026-05-17 04:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords chain-of-thoughtreasoninglanguage modelscontemplation tokenscompressed representationsefficient decodingdecoder models

0 comments

The pith

Language models can reason more accurately by generating compressed continuous tokens that stand in for full reasoning chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Compressed Chain-of-Thought as a method that lets decoder language models produce variable-length sequences of continuous contemplation tokens during inference. These tokens are designed to act as dense, contentful representations of what would otherwise be explicit step-by-step reasoning written in words. Experiments demonstrate that reasoning over these representations yields accuracy gains on tasks that benefit from chain-of-thought prompting. The gains scale with the number of tokens generated, so users can request more or less extra computation as needed. The approach requires no special model training and works on existing off-the-shelf decoders.

Core claim

Compressed Chain-of-Thought generates contentful and continuous contemplation tokens of variable sequence length that serve as compressed representations of explicit reasoning chains; when decoder language models perform additional reasoning over these dense representations, accuracy improves, and the amount of improvement can be controlled on demand simply by varying the number of contemplation tokens produced.

What carries the argument

The generation of variable-length continuous contemplation tokens as compressed representations of explicit reasoning chains, which allow extra computation inside the model without producing discrete text output.

Load-bearing premise

The continuous contemplation tokens actually encode and preserve the semantic content of explicit reasoning chains instead of acting mainly as extra learned parameters whose benefit is unrelated to interpretable reasoning.

What would settle it

An experiment in which the generated contemplation tokens are replaced at inference time with random vectors of the same length and the accuracy gains disappear would show that the tokens are not carrying semantic reasoning content.

read the original abstract

Chain-of-thought (CoT) decoding enables language models to improve reasoning performance at the cost of high generation latency in decoding. Recent proposals have explored variants of contemplation tokens, a term we introduce that refers to special tokens used during inference to allow for extra computation. Prior work has considered fixed-length sequences drawn from a discrete set of embeddings as contemplation tokens. Here we propose Compressed Chain-of-Thought (CCoT), a framework to generate contentful and continuous contemplation tokens of variable sequence length. The generated contemplation tokens are compressed representations of explicit reasoning chains, and our method can be applied to off-the-shelf decoder language models. Through experiments, we illustrate how CCoT enables additional reasoning over dense contentful representations to achieve corresponding improvements in accuracy. Moreover, the reasoning improvements can be adaptively modified on demand by controlling the number of contemplation tokens generated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CCoT swaps fixed discrete contemplation tokens for variable-length continuous ones to give a control knob on extra test-time compute, but the evidence that these are truly compressed reasoning content is still thin.

read the letter

The main point here is a straightforward extension of contemplation tokens: instead of fixed-length discrete embeddings, they generate continuous dense vectors whose count can vary during inference. This is meant to let decoder-only models do extra reasoning steps in a compressed form while keeping the ability to dial the amount of compute up or down on demand. The framing is that these tokens act as compressed versions of explicit reasoning chains, and the paper says it works on off-the-shelf models without major changes. That variable-length control is the practical piece that stands out if it works cleanly. It gives a simple way to trade accuracy for latency that prior fixed-token approaches did not offer as directly. The experiments are described as showing corresponding accuracy gains, which would be useful for efficiency work if the numbers hold with proper baselines. The soft spots sit in the support for the central claim. The abstract gives no quantitative results, no error bars, and no details on how the continuous tokens are trained or decoded, so the accuracy improvements cannot be checked yet. More critically, nothing in the description shows that the tokens actually preserve semantic content from reasoning chains rather than functioning as extra learned parameters that simply increase forward passes. The stress-test concern is on target here: without ablations that isolate content from length or any probing that ties individual tokens to specific reasoning steps, the gains are explainable by ordinary compute scaling. This is for researchers who care about test-time scaling tricks for reasoning models. A reader already working on efficient CoT variants would pick up the variable-length idea and the off-the-shelf applicability. It deserves a serious referee to examine the full methods, training procedure, and any ablations that address the content question. I would send it to review rather than desk reject, but with a note to strengthen the evidence that the representations are doing more than adding generic computation steps.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Compressed Chain-of-Thought (CCoT), a framework for generating variable-length continuous contemplation tokens as compressed representations of explicit reasoning chains. These tokens are intended to enable additional reasoning over dense contentful representations in off-the-shelf decoder language models, yielding accuracy gains that can be adaptively controlled by varying the number of tokens generated.

Significance. If the central claims are substantiated, the work could advance efficient inference-time reasoning in language models by replacing explicit discrete chains with dense continuous representations, offering potential latency benefits and adaptive control. The extension from fixed discrete contemplation tokens to continuous variable-length ones is a clear technical step. However, the significance depends heavily on evidence that the tokens preserve specific semantic content from reasoning steps rather than providing generic extra computation.

major comments (2)

[Abstract] Abstract: the claim that experiments demonstrate accuracy improvements from 'additional reasoning over dense contentful representations' is unsupported by any quantitative results, baselines, error bars, or details on training/decoding of the continuous tokens; without these, the central empirical claim cannot be evaluated.
[Abstract / Methods] The framing that contemplation tokens are 'compressed representations of explicit reasoning chains' and 'contentful' requires load-bearing evidence such as ablations separating semantic content from token count/length effects or probing/reconstruction experiments linking individual tokens to specific reasoning steps; absent this, gains are consistent with standard inference compute scaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and commit to revisions that strengthen the presentation of our empirical results and the evidence for the semantic content of the contemplation tokens.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that experiments demonstrate accuracy improvements from 'additional reasoning over dense contentful representations' is unsupported by any quantitative results, baselines, error bars, or details on training/decoding of the continuous tokens; without these, the central empirical claim cannot be evaluated.

Authors: We agree that the abstract is too high-level. The full manuscript reports concrete results on standard reasoning benchmarks (including accuracy deltas versus CoT and other baselines, with standard error bars across multiple random seeds) and describes the training objective plus decoding procedure for the continuous tokens. We will revise the abstract to include a concise summary of these quantitative findings and methodological details so that the central claim can be evaluated directly from the abstract. revision: yes
Referee: [Abstract / Methods] The framing that contemplation tokens are 'compressed representations of explicit reasoning chains' and 'contentful' requires load-bearing evidence such as ablations separating semantic content from token count/length effects or probing/reconstruction experiments linking individual tokens to specific reasoning steps; absent this, gains are consistent with standard inference compute scaling.

Authors: We acknowledge that the current manuscript does not contain explicit probing or reconstruction experiments that directly map individual tokens to specific reasoning steps. However, our experiments already include controls that vary the number of contemplation tokens while holding total inference compute roughly constant and compare against both standard CoT and fixed-length discrete contemplation baselines; the observed accuracy gains exceed what is explained by additional compute alone. To address the referee's concern more directly, we will add an ablation that replaces the learned continuous tokens with random vectors of identical length and dimensionality, thereby isolating semantic content from mere length effects. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical training procedure with no derivation chain

full rationale

The paper describes CCoT as a framework for generating variable-length continuous contemplation tokens from off-the-shelf decoder LMs, with the claim that these tokens serve as compressed representations of explicit reasoning chains. No equations, first-principles derivations, or closed-form predictions appear in the provided abstract or description. The approach is presented as a training/inference procedure whose benefits are illustrated through experiments on accuracy improvements, rather than any mathematical reduction that equates outputs to inputs by construction. Self-citations or ansatzes are not load-bearing in the given material, and the central claim does not reduce to a fitted parameter renamed as a prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated assumption that a decoder-only language model can be trained or prompted to emit semantically meaningful continuous vectors that function as compressed reasoning steps; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5435 in / 1200 out tokens · 28459 ms · 2026-05-17T04:44:18.665036+00:00 · methodology

discussion (0)

Forward citations

Cited by 49 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning through Internalization
cs.LG 2026-06 unverdicted novelty 7.0

A simplified one-layer transformer provably learns parities first with explicit CoT supervision then internalizes to direct computation as CoT tokens are removed.
DeepLatent: Think with Images via Parallel Latent Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
Unlocking the Working Memory of Large Language Models for Latent Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

RiM trains LLMs to perform latent reasoning via fixed memory blocks processed in one forward pass using a two-stage curriculum, matching or exceeding prior latent methods on benchmarks.
Training-Free Looped Transformers
cs.LG 2026-05 unverdicted novelty 7.0

Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective
cs.LG 2026-05 unverdicted novelty 7.0

Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
cs.CL 2026-05 unverdicted novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
cs.CL 2026-04 unverdicted novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
cs.CV 2026-03 unverdicted novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
cs.AI 2025-10 unverdicted novelty 7.0

CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
Latent Visual Reasoning
cs.CV 2025-09 unverdicted novelty 7.0

Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
cs.CL 2025-02 unverdicted novelty 7.0

CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than ...
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts
cs.CV 2026-06 unverdicted novelty 6.0

CoLT replaces text-based chain-of-thought in MLLMs with 3-step latent thought chains supervised by a removable external decoder in forward and backward modes, yielding 10.1x faster inference on eight benchmarks.
Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers
cs.LG 2026-06 unverdicted novelty 6.0

LOTUS uses a looped padded Transformer with parallel cross-entropy supervision on gold CoT tokens to match explicit CoT performance at 3B parameters while reducing thought-phase latency 2.5x-6.9x.
VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context
cs.CV 2026-06 unverdicted novelty 6.0

VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inferenc...
When LLMs Develop Languages: Symbolic Communication for Efficient Multi-Agent Reasoning
cs.AI 2026-06 unverdicted novelty 6.0

CLSR lets LLM agents evolve and route symbolic languages that reduce generated tokens by 3-6x versus chain-of-thought while keeping accuracy on benchmarks.
Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models
cs.CL 2026-06 unverdicted novelty 6.0

Dynamic Rollout Editing reduces overthinking in RL-trained LLMs by editing post-answer continuations in successful rollouts and preferring the edited versions within GRPO groups.
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
cs.LG 2026-06 unverdicted novelty 6.0

Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.
Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
cs.AI 2026-06 unverdicted novelty 6.0

No-CoT 50% task-completion time horizons for frontier models have doubled yearly for six years, reaching over 3 minutes for GPT-5.5, with median projections of 7 minutes by 2028 and 25 minutes by 2030.
MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action
cs.RO 2026-06 unverdicted novelty 6.0

MPCoT improves long-horizon VLA performance on LIBERO and CALVIN by initializing M latent hypotheses, refining them over K steps, and aggregating via a reward-trained path scorer while preserving the original 8-step a...
Adaptive Latent Agentic Reasoning
cs.CL 2026-06 unverdicted novelty 6.0

ALAR trains LLM agents to perform most reasoning in a latent space supervised by actions and escalates to explicit CoT only when needed, cutting tokens by up to 84.6% while preserving accuracy on search and tool-use b...
Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning
cs.LG 2026-06 unverdicted novelty 6.0

SpecFlow represents intermediate visual thoughts in fixed-size DCT space and uses classifier-free guidance to steer updates from textual thoughts, achieving up to 2.1x lower computation and KV cache costs.
ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks
cs.LG 2026-05 unverdicted novelty 6.0

ThinkSwitch uses iterative self-distillation with QLoRA and spherical weight interpolation to raise both instruct and thinking checkpoint accuracy on small AIME and PubMedQA sets using only 15 human prompts per domain.
Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems
cs.CR 2026-05 unverdicted novelty 6.0

Latent interventions can reactivate attack effects in clean executions of latent-based multi-agent systems, degrading performance especially via inter-agent KV-cache handoffs.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
LEPO: Latent Reasoning Policy Optimization for Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
SeLaR: Selective Latent Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
LightThinker++: From Reasoning Compression to Memory Management
cs.CL 2026-04 unverdicted novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
DiscoLoop: Looping Discrete Embeddings and Continuous Hidden States for Multi-hop Reasoning
cs.CL 2026-07 unverdicted novelty 5.0

DiscoLoop adds a discrete embedding channel to looped transformers to fix representational misalignment in two-hop reasoning, yielding near-perfect accuracy on synthetic tasks and better pretraining loss on real data.
Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
cs.AI 2026-06 unverdicted novelty 5.0

Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.
Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces
cs.CL 2026-06 unverdicted novelty 5.0

Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs
cs.CL 2026-05 unverdicted novelty 5.0

HAB applies coarse-to-fine budgeting to LLM reasoning, predicting per-problem depth and learning intra-step token budgets via PPL comparisons and adaptive Pareto optimization, yielding higher accuracy and lower token ...
Generative Spatiotemporal Intent Sequence Recommendation via Implicit Reasoning in Amap
cs.IR 2026-05 unverdicted novelty 5.0

GPlan compresses LLM reasoning into small models via Progressive Implicit CoT Distillation and Spatiotemporal Counterfactual DPO to generate logically coherent and physically executable intent sequences for recommendation.
Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models
cs.LG 2026-05 unverdicted novelty 5.0

STARS trains looped language models with Jacobian spectral radius regularization and random loop sampling to drive latent states toward asymptotically stable fixed points, yielding reliable test-time scaling on arithm...
LEPO: Latent Reasoning Policy Optimization for Large Language Models
cs.LG 2026-04 unverdicted novelty 5.0

LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
ConFu: Contemplate the Future for Better Speculative Sampling
cs.CL 2026-03 unverdicted novelty 5.0

ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
Efficient Reasoning with Hidden Thinking
cs.CL 2025-01 unverdicted novelty 5.0

Heima compresses verbose CoT into hidden thinking tokens via information-theoretic analysis and an adaptive interpreter, claiming maintained or improved zero-shot accuracy on reasoning benchmarks.
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes
cs.CL 2026-06 unverdicted novelty 4.0

A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 43 Pith papers · 10 internal anchors

[1]

Memory transformer

URL https://arxiv. org/abs/2006.11527. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems,

work page arXiv 2006
[2]

Training Verifiers to Solve Math Word Problems

URL https://arxiv. org/abs/2110.14168. Deng, Y ., Prasad, K., Fernandez, R., Smolensky, P., Chaud- hary, V ., and Shieber, S. Implicit chain of thought rea- soning via knowledge distillation,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Implicit chain of thought reasoning via knowledge distillation.arXiv preprint arXiv:2311.01460, 2023

URL https: //arxiv.org/abs/2311.01460. Deng, Y ., Choi, Y ., and Shieber, S. From explicit cot to implicit cot: Learning to internalize cot step by step,

work page arXiv
[4]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

URL https://arxiv.org/abs/2405.14838. Ge, T., Hu, J., Wang, L., Wang, X., Chen, S.-Q., and Wei, F. In-context autoencoder for context compression in a large language model,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

In-context autoencoder for context compression in a large language model

URL https://arxiv. org/abs/2307.06945. Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., and Nagarajan, V . Think before you speak: Training language models with pause tokens,

work page arXiv
[6]

arXiv preprint arXiv:2310.02226 , year =

URL https: //arxiv.org/abs/2310.02226. Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space,

work page arXiv
[7]

Training Large Language Models to Reason in a Continuous Latent Space

URL https:// arxiv.org/abs/2412.06769. Herel, D. and Mikolov, T. Thinking tokens for language modeling,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

URL https://arxiv.org/abs/ 2405.08644. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models,

work page arXiv
[9]

LoRA: Low-Rank Adaptation of Large Language Models

URL https://arxiv. org/abs/2106.09685. Jiang, H., Wu, Q., Lin, C.-Y ., Yang, Y ., and Qiu, L. Llmlingua: Compressing prompts for accelerated infer- ence of large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Llmlingua: Compressing prompts for accelerated inference of large language models

URL https: //arxiv.org/abs/2310.05736. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners,

work page arXiv
[11]

Large Language Models are Zero-Shot Reasoners

URL https://arxiv.org/abs/2205.11916. Kou, S., Hu, L., He, Z., Deng, Z., and Zhang, H. Cllms: Consistency large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

URL https: //arxiv.org/abs/2403.00835. 8 Compressed Chain of Thought Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Deni- son, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., Luko ˇsi¯ut˙e, K., Nguyen, K., Cheng, N., Joseph, N., Schiefer, N., Rausch, O., Larson, R., McCan- dlish, S., Kundu, S., Kadavath, S., Yang, S., Henighan, T....

work page arXiv
[13]

Measuring Faithfulness in Chain-of-Thought Reasoning

URL https://arxiv.org/abs/ 2307.13702. Liu, H., Sferrazza, C., and Abbeel, P. Chain of hind- sight aligns language models with feedback,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Chain of

URL https://arxiv.org/abs/2302.02676. Liu, Y ., Li, H., Cheng, Y ., Ray, S., Huang, Y ., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Maire, M., Hoffmann, H., Holtzman, A., and Jiang, J. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, pp. 3...

work page arXiv 2024
[15]

Mellette, Alex Forencich, Rukshani Athapathu, Alex C

Associa- tion for Computing Machinery. ISBN 9798400706141. doi: 10.1145/3651890.3672274. URL https://doi. org/10.1145/3651890.3672274. Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., and Wang, Y . Skeleton-of-thought: Prompting llms for efficient par- allel generation,

work page doi:10.1145/3651890.3672274
[16]

arXiv preprint arXiv:2307.15337 , year=

URL https://arxiv.org/ abs/2307.15337. Pfau, J., Merrill, W., and Bowman, S. R. Let’s think dot by dot: Hidden computation in transformer language mod- els,

work page arXiv
[17]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URL https://arxiv.org/abs/2307.09288. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Attention Is All You Need

URL https://arxiv.org/ abs/1706.03762. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of- thought prompting elicits reasoning in large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

URL https://arxiv.org/abs/ 2201.11903. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

URL https://arxiv.org/abs/2305.10601. Zhang, H., Liu, Z., Zhao, Y ., Zheng, J., Zhuang, C., Gu, J., and Chen, G. Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster

URL https://arxiv.org/abs/2311.08263. 9 Compressed Chain of Thought A. Varying the autoregressive layer Our method CCOT autoregressively generates contemplation tokens by using the hidden state at the lth layer at index i as the input embedding at index i +

work page arXiv
[22]

NONE refers to the baseline where no contemplation tokens are decoded during inference

Accuracy on GSM8K with our method CCOT with a com- pression ratio of r = 0.05 when varying the autoregressive layer l. NONE refers to the baseline where no contemplation tokens are decoded during inference. B. Further Theoretical Considerations In this section, we formalize the two insights outlined in Section 6.2. We note that an analysis of the enhanced...

work page 2024
[23]

(2024) below

We formally introduce the new class of problems and outline the assumptions made by Goyal et al. (2024) below. Assumption B.1. (structure of underlying task) Assume a vocabulary V and a embedding dimension of d. Let ◦ be a genetic 2-ary operator on the embedding space Rd. For a given input length N, define the class of functions FM,K to be the set of all ...

work page 2024

[1] [1]

Memory transformer

URL https://arxiv. org/abs/2006.11527. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems,

work page arXiv 2006

[2] [2]

Training Verifiers to Solve Math Word Problems

URL https://arxiv. org/abs/2110.14168. Deng, Y ., Prasad, K., Fernandez, R., Smolensky, P., Chaud- hary, V ., and Shieber, S. Implicit chain of thought rea- soning via knowledge distillation,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Implicit chain of thought reasoning via knowledge distillation.arXiv preprint arXiv:2311.01460, 2023

URL https: //arxiv.org/abs/2311.01460. Deng, Y ., Choi, Y ., and Shieber, S. From explicit cot to implicit cot: Learning to internalize cot step by step,

work page arXiv

[4] [4]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

URL https://arxiv.org/abs/2405.14838. Ge, T., Hu, J., Wang, L., Wang, X., Chen, S.-Q., and Wei, F. In-context autoencoder for context compression in a large language model,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

In-context autoencoder for context compression in a large language model

URL https://arxiv. org/abs/2307.06945. Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., and Nagarajan, V . Think before you speak: Training language models with pause tokens,

work page arXiv

[6] [6]

arXiv preprint arXiv:2310.02226 , year =

URL https: //arxiv.org/abs/2310.02226. Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space,

work page arXiv

[7] [7]

Training Large Language Models to Reason in a Continuous Latent Space

URL https:// arxiv.org/abs/2412.06769. Herel, D. and Mikolov, T. Thinking tokens for language modeling,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

URL https://arxiv.org/abs/ 2405.08644. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models,

work page arXiv

[9] [9]

LoRA: Low-Rank Adaptation of Large Language Models

URL https://arxiv. org/abs/2106.09685. Jiang, H., Wu, Q., Lin, C.-Y ., Yang, Y ., and Qiu, L. Llmlingua: Compressing prompts for accelerated infer- ence of large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Llmlingua: Compressing prompts for accelerated inference of large language models

URL https: //arxiv.org/abs/2310.05736. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners,

work page arXiv

[11] [11]

Large Language Models are Zero-Shot Reasoners

URL https://arxiv.org/abs/2205.11916. Kou, S., Hu, L., He, Z., Deng, Z., and Zhang, H. Cllms: Consistency large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

URL https: //arxiv.org/abs/2403.00835. 8 Compressed Chain of Thought Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Deni- son, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., Luko ˇsi¯ut˙e, K., Nguyen, K., Cheng, N., Joseph, N., Schiefer, N., Rausch, O., Larson, R., McCan- dlish, S., Kundu, S., Kadavath, S., Yang, S., Henighan, T....

work page arXiv

[13] [13]

Measuring Faithfulness in Chain-of-Thought Reasoning

URL https://arxiv.org/abs/ 2307.13702. Liu, H., Sferrazza, C., and Abbeel, P. Chain of hind- sight aligns language models with feedback,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Chain of

URL https://arxiv.org/abs/2302.02676. Liu, Y ., Li, H., Cheng, Y ., Ray, S., Huang, Y ., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Maire, M., Hoffmann, H., Holtzman, A., and Jiang, J. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, pp. 3...

work page arXiv 2024

[15] [15]

Mellette, Alex Forencich, Rukshani Athapathu, Alex C

Associa- tion for Computing Machinery. ISBN 9798400706141. doi: 10.1145/3651890.3672274. URL https://doi. org/10.1145/3651890.3672274. Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., and Wang, Y . Skeleton-of-thought: Prompting llms for efficient par- allel generation,

work page doi:10.1145/3651890.3672274

[16] [16]

arXiv preprint arXiv:2307.15337 , year=

URL https://arxiv.org/ abs/2307.15337. Pfau, J., Merrill, W., and Bowman, S. R. Let’s think dot by dot: Hidden computation in transformer language mod- els,

work page arXiv

[17] [17]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URL https://arxiv.org/abs/2307.09288. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Attention Is All You Need

URL https://arxiv.org/ abs/1706.03762. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of- thought prompting elicits reasoning in large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

URL https://arxiv.org/abs/ 2201.11903. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

URL https://arxiv.org/abs/2305.10601. Zhang, H., Liu, Z., Zhao, Y ., Zheng, J., Zhuang, C., Gu, J., and Chen, G. Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster

URL https://arxiv.org/abs/2311.08263. 9 Compressed Chain of Thought A. Varying the autoregressive layer Our method CCOT autoregressively generates contemplation tokens by using the hidden state at the lth layer at index i as the input embedding at index i +

work page arXiv

[22] [22]

NONE refers to the baseline where no contemplation tokens are decoded during inference

Accuracy on GSM8K with our method CCOT with a com- pression ratio of r = 0.05 when varying the autoregressive layer l. NONE refers to the baseline where no contemplation tokens are decoded during inference. B. Further Theoretical Considerations In this section, we formalize the two insights outlined in Section 6.2. We note that an analysis of the enhanced...

work page 2024

[23] [23]

(2024) below

We formally introduce the new class of problems and outline the assumptions made by Goyal et al. (2024) below. Assumption B.1. (structure of underlying task) Assume a vocabulary V and a embedding dimension of d. Let ◦ be a genetic 2-ary operator on the embedding space Rd. For a given input length N, define the class of functions FM,K to be the set of all ...

work page 2024