pith. sign in

arxiv: 2412.13171 · v1 · pith:NRUDAJ65new · submitted 2024-12-17 · 💻 cs.CL

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Pith reviewed 2026-05-17 04:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords chain-of-thoughtreasoninglanguage modelscontemplation tokenscompressed representationsefficient decodingdecoder models
0
0 comments X

The pith

Language models can reason more accurately by generating compressed continuous tokens that stand in for full reasoning chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Compressed Chain-of-Thought as a method that lets decoder language models produce variable-length sequences of continuous contemplation tokens during inference. These tokens are designed to act as dense, contentful representations of what would otherwise be explicit step-by-step reasoning written in words. Experiments demonstrate that reasoning over these representations yields accuracy gains on tasks that benefit from chain-of-thought prompting. The gains scale with the number of tokens generated, so users can request more or less extra computation as needed. The approach requires no special model training and works on existing off-the-shelf decoders.

Core claim

Compressed Chain-of-Thought generates contentful and continuous contemplation tokens of variable sequence length that serve as compressed representations of explicit reasoning chains; when decoder language models perform additional reasoning over these dense representations, accuracy improves, and the amount of improvement can be controlled on demand simply by varying the number of contemplation tokens produced.

What carries the argument

The generation of variable-length continuous contemplation tokens as compressed representations of explicit reasoning chains, which allow extra computation inside the model without producing discrete text output.

Load-bearing premise

The continuous contemplation tokens actually encode and preserve the semantic content of explicit reasoning chains instead of acting mainly as extra learned parameters whose benefit is unrelated to interpretable reasoning.

What would settle it

An experiment in which the generated contemplation tokens are replaced at inference time with random vectors of the same length and the accuracy gains disappear would show that the tokens are not carrying semantic reasoning content.

read the original abstract

Chain-of-thought (CoT) decoding enables language models to improve reasoning performance at the cost of high generation latency in decoding. Recent proposals have explored variants of contemplation tokens, a term we introduce that refers to special tokens used during inference to allow for extra computation. Prior work has considered fixed-length sequences drawn from a discrete set of embeddings as contemplation tokens. Here we propose Compressed Chain-of-Thought (CCoT), a framework to generate contentful and continuous contemplation tokens of variable sequence length. The generated contemplation tokens are compressed representations of explicit reasoning chains, and our method can be applied to off-the-shelf decoder language models. Through experiments, we illustrate how CCoT enables additional reasoning over dense contentful representations to achieve corresponding improvements in accuracy. Moreover, the reasoning improvements can be adaptively modified on demand by controlling the number of contemplation tokens generated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Compressed Chain-of-Thought (CCoT), a framework for generating variable-length continuous contemplation tokens as compressed representations of explicit reasoning chains. These tokens are intended to enable additional reasoning over dense contentful representations in off-the-shelf decoder language models, yielding accuracy gains that can be adaptively controlled by varying the number of tokens generated.

Significance. If the central claims are substantiated, the work could advance efficient inference-time reasoning in language models by replacing explicit discrete chains with dense continuous representations, offering potential latency benefits and adaptive control. The extension from fixed discrete contemplation tokens to continuous variable-length ones is a clear technical step. However, the significance depends heavily on evidence that the tokens preserve specific semantic content from reasoning steps rather than providing generic extra computation.

major comments (2)
  1. [Abstract] Abstract: the claim that experiments demonstrate accuracy improvements from 'additional reasoning over dense contentful representations' is unsupported by any quantitative results, baselines, error bars, or details on training/decoding of the continuous tokens; without these, the central empirical claim cannot be evaluated.
  2. [Abstract / Methods] The framing that contemplation tokens are 'compressed representations of explicit reasoning chains' and 'contentful' requires load-bearing evidence such as ablations separating semantic content from token count/length effects or probing/reconstruction experiments linking individual tokens to specific reasoning steps; absent this, gains are consistent with standard inference compute scaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and commit to revisions that strengthen the presentation of our empirical results and the evidence for the semantic content of the contemplation tokens.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that experiments demonstrate accuracy improvements from 'additional reasoning over dense contentful representations' is unsupported by any quantitative results, baselines, error bars, or details on training/decoding of the continuous tokens; without these, the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract is too high-level. The full manuscript reports concrete results on standard reasoning benchmarks (including accuracy deltas versus CoT and other baselines, with standard error bars across multiple random seeds) and describes the training objective plus decoding procedure for the continuous tokens. We will revise the abstract to include a concise summary of these quantitative findings and methodological details so that the central claim can be evaluated directly from the abstract. revision: yes

  2. Referee: [Abstract / Methods] The framing that contemplation tokens are 'compressed representations of explicit reasoning chains' and 'contentful' requires load-bearing evidence such as ablations separating semantic content from token count/length effects or probing/reconstruction experiments linking individual tokens to specific reasoning steps; absent this, gains are consistent with standard inference compute scaling.

    Authors: We acknowledge that the current manuscript does not contain explicit probing or reconstruction experiments that directly map individual tokens to specific reasoning steps. However, our experiments already include controls that vary the number of contemplation tokens while holding total inference compute roughly constant and compare against both standard CoT and fixed-length discrete contemplation baselines; the observed accuracy gains exceed what is explained by additional compute alone. To address the referee's concern more directly, we will add an ablation that replaces the learned continuous tokens with random vectors of identical length and dimensionality, thereby isolating semantic content from mere length effects. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical training procedure with no derivation chain

full rationale

The paper describes CCoT as a framework for generating variable-length continuous contemplation tokens from off-the-shelf decoder LMs, with the claim that these tokens serve as compressed representations of explicit reasoning chains. No equations, first-principles derivations, or closed-form predictions appear in the provided abstract or description. The approach is presented as a training/inference procedure whose benefits are illustrated through experiments on accuracy improvements, rather than any mathematical reduction that equates outputs to inputs by construction. Self-citations or ansatzes are not load-bearing in the given material, and the central claim does not reduce to a fitted parameter renamed as a prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated assumption that a decoder-only language model can be trained or prompted to emit semantically meaningful continuous vectors that function as compressed reasoning steps; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5435 in / 1200 out tokens · 28459 ms · 2026-05-17T04:44:18.665036+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 49 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning through Internalization

    cs.LG 2026-06 unverdicted novelty 7.0

    A simplified one-layer transformer provably learns parities first with explicit CoT supervision then internalizes to direct computation as CoT tokens are removed.

  2. DeepLatent: Think with Images via Parallel Latent Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

  3. Unlocking the Working Memory of Large Language Models for Latent Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    RiM trains LLMs to perform latent reasoning via fixed memory blocks processed in one forward pass using a two-stage curriculum, matching or exceeding prior latent methods on benchmarks.

  4. Training-Free Looped Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.

  5. On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.

  6. LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

    cs.CL 2026-05 unverdicted novelty 7.0

    LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.

  7. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  8. Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

    cs.CL 2026-04 unverdicted novelty 7.0

    Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

  9. V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

    cs.CV 2026-03 unverdicted novelty 7.0

    V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...

  10. Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

    cs.AI 2025-10 unverdicted novelty 7.0

    CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.

  11. Latent Visual Reasoning

    cs.CV 2025-09 unverdicted novelty 7.0

    Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.

  12. CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

    cs.CL 2025-02 unverdicted novelty 7.0

    CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than ...

  13. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  14. CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts

    cs.CV 2026-06 unverdicted novelty 6.0

    CoLT replaces text-based chain-of-thought in MLLMs with 3-step latent thought chains supervised by a removable external decoder in forward and backward modes, yielding 10.1x faster inference on eight benchmarks.

  15. Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

    cs.LG 2026-06 unverdicted novelty 6.0

    LOTUS uses a looped padded Transformer with parallel cross-entropy supervision on gold CoT tokens to match explicit CoT performance at 3B parameters while reducing thought-phase latency 2.5x-6.9x.

  16. VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

    cs.CV 2026-06 unverdicted novelty 6.0

    VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inferenc...

  17. When LLMs Develop Languages: Symbolic Communication for Efficient Multi-Agent Reasoning

    cs.AI 2026-06 unverdicted novelty 6.0

    CLSR lets LLM agents evolve and route symbolic languages that reduce generated tokens by 3-6x versus chain-of-thought while keeping accuracy on benchmarks.

  18. Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

    cs.CL 2026-06 unverdicted novelty 6.0

    Dynamic Rollout Editing reduces overthinking in RL-trained LLMs by editing post-answer continuations in successful rollouts and preferring the edited versions within GRPO groups.

  19. Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

    cs.LG 2026-06 unverdicted novelty 6.0

    Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.

  20. Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

    cs.AI 2026-06 unverdicted novelty 6.0

    No-CoT 50% task-completion time horizons for frontier models have doubled yearly for six years, reaching over 3 minutes for GPT-5.5, with median projections of 7 minutes by 2028 and 25 minutes by 2030.

  21. MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

    cs.RO 2026-06 unverdicted novelty 6.0

    MPCoT improves long-horizon VLA performance on LIBERO and CALVIN by initializing M latent hypotheses, refining them over K steps, and aggregating via a reward-trained path scorer while preserving the original 8-step a...

  22. Adaptive Latent Agentic Reasoning

    cs.CL 2026-06 unverdicted novelty 6.0

    ALAR trains LLM agents to perform most reasoning in a latent space supervised by actions and escalates to explicit CoT only when needed, cutting tokens by up to 84.6% while preserving accuracy on search and tool-use b...

  23. Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

    cs.LG 2026-06 unverdicted novelty 6.0

    SpecFlow represents intermediate visual thoughts in fixed-size DCT space and uses classifier-free guidance to steer updates from textual thoughts, achieving up to 2.1x lower computation and KV cache costs.

  24. ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

    cs.LG 2026-05 unverdicted novelty 6.0

    ThinkSwitch uses iterative self-distillation with QLoRA and spherical weight interpolation to raise both instruct and thinking checkpoint accuracy on small AIME and PubMedQA sets using only 15 human prompts per domain.

  25. Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems

    cs.CR 2026-05 unverdicted novelty 6.0

    Latent interventions can reactivate attack effects in clean executions of latent-based multi-agent systems, degrading performance especially via inter-agent KV-cache handoffs.

  26. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  27. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  28. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  29. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  30. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  31. LEPO: Latent Reasoning Policy Optimization for Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.

  32. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  33. SeLaR: Selective Latent Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.

  34. LightThinker++: From Reasoning Compression to Memory Management

    cs.CL 2026-04 unverdicted novelty 6.0

    LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

  35. DiscoLoop: Looping Discrete Embeddings and Continuous Hidden States for Multi-hop Reasoning

    cs.CL 2026-07 unverdicted novelty 5.0

    DiscoLoop adds a discrete embedding channel to looped transformers to fix representational misalignment in two-hop reasoning, yielding near-perfect accuracy on synthetic tasks and better pretraining loss on real data.

  36. Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

    cs.AI 2026-06 unverdicted novelty 5.0

    Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.

  37. Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

    cs.CL 2026-06 unverdicted novelty 5.0

    Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.

  38. Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs

    cs.CL 2026-05 unverdicted novelty 5.0

    HAB applies coarse-to-fine budgeting to LLM reasoning, predicting per-problem depth and learning intra-step token budgets via PPL comparisons and adaptive Pareto optimization, yielding higher accuracy and lower token ...

  39. Generative Spatiotemporal Intent Sequence Recommendation via Implicit Reasoning in Amap

    cs.IR 2026-05 unverdicted novelty 5.0

    GPlan compresses LLM reasoning into small models via Progressive Implicit CoT Distillation and Spatiotemporal Counterfactual DPO to generate logically coherent and physically executable intent sequences for recommendation.

  40. Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

    cs.LG 2026-05 unverdicted novelty 5.0

    STARS trains looped language models with Jacobian spectral radius regularization and random loop sampling to drive latent states toward asymptotically stable fixed points, yielding reliable test-time scaling on arithm...

  41. LEPO: Latent Reasoning Policy Optimization for Large Language Models

    cs.LG 2026-04 unverdicted novelty 5.0

    LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.

  42. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  43. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

  44. MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...

  45. ConFu: Contemplate the Future for Better Speculative Sampling

    cs.CL 2026-03 unverdicted novelty 5.0

    ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.

  46. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    cs.CL 2025-03 accept novelty 5.0

    A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

  47. Efficient Reasoning with Hidden Thinking

    cs.CL 2025-01 unverdicted novelty 5.0

    Heima compresses verbose CoT into hidden thinking tokens via information-theoretic analysis and an adaptive interpreter, claiming maintained or improved zero-shot accuracy on reasoning benchmarks.

  48. The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

    cs.CL 2026-06 unverdicted novelty 4.0

    A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.

  49. Token Economics for LLM Agents: A Dual-View Study from Computing and Economics

    cs.AI 2026-05 unverdicted novelty 4.0

    The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 43 Pith papers · 10 internal anchors

  1. [1]

    Memory transformer

    URL https://arxiv. org/abs/2006.11527. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems,

  2. [2]

    Training Verifiers to Solve Math Word Problems

    URL https://arxiv. org/abs/2110.14168. Deng, Y ., Prasad, K., Fernandez, R., Smolensky, P., Chaud- hary, V ., and Shieber, S. Implicit chain of thought rea- soning via knowledge distillation,

  3. [3]

    Implicit chain of thought reasoning via knowledge distillation.arXiv preprint arXiv:2311.01460, 2023

    URL https: //arxiv.org/abs/2311.01460. Deng, Y ., Choi, Y ., and Shieber, S. From explicit cot to implicit cot: Learning to internalize cot step by step,

  4. [4]

    From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

    URL https://arxiv.org/abs/2405.14838. Ge, T., Hu, J., Wang, L., Wang, X., Chen, S.-Q., and Wei, F. In-context autoencoder for context compression in a large language model,

  5. [5]

    In-context autoencoder for context compression in a large language model

    URL https://arxiv. org/abs/2307.06945. Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., and Nagarajan, V . Think before you speak: Training language models with pause tokens,

  6. [6]

    arXiv preprint arXiv:2310.02226 , year =

    URL https: //arxiv.org/abs/2310.02226. Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space,

  7. [7]

    Training Large Language Models to Reason in a Continuous Latent Space

    URL https:// arxiv.org/abs/2412.06769. Herel, D. and Mikolov, T. Thinking tokens for language modeling,

  8. [8]

    URL https://arxiv.org/abs/ 2405.08644. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models,

  9. [9]

    LoRA: Low-Rank Adaptation of Large Language Models

    URL https://arxiv. org/abs/2106.09685. Jiang, H., Wu, Q., Lin, C.-Y ., Yang, Y ., and Qiu, L. Llmlingua: Compressing prompts for accelerated infer- ence of large language models,

  10. [10]

    Llmlingua: Compressing prompts for accelerated inference of large language models

    URL https: //arxiv.org/abs/2310.05736. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners,

  11. [11]

    Large Language Models are Zero-Shot Reasoners

    URL https://arxiv.org/abs/2205.11916. Kou, S., Hu, L., He, Z., Deng, Z., and Zhang, H. Cllms: Consistency large language models,

  12. [12]

    URL https: //arxiv.org/abs/2403.00835. 8 Compressed Chain of Thought Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Deni- son, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., Luko ˇsi¯ut˙e, K., Nguyen, K., Cheng, N., Joseph, N., Schiefer, N., Rausch, O., Larson, R., McCan- dlish, S., Kundu, S., Kadavath, S., Yang, S., Henighan, T....

  13. [13]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    URL https://arxiv.org/abs/ 2307.13702. Liu, H., Sferrazza, C., and Abbeel, P. Chain of hind- sight aligns language models with feedback,

  14. [14]

    Chain of

    URL https://arxiv.org/abs/2302.02676. Liu, Y ., Li, H., Cheng, Y ., Ray, S., Huang, Y ., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Maire, M., Hoffmann, H., Holtzman, A., and Jiang, J. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, pp. 3...

  15. [15]

    Mellette, Alex Forencich, Rukshani Athapathu, Alex C

    Associa- tion for Computing Machinery. ISBN 9798400706141. doi: 10.1145/3651890.3672274. URL https://doi. org/10.1145/3651890.3672274. Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., and Wang, Y . Skeleton-of-thought: Prompting llms for efficient par- allel generation,

  16. [16]

    arXiv preprint arXiv:2307.15337 , year=

    URL https://arxiv.org/ abs/2307.15337. Pfau, J., Merrill, W., and Bowman, S. R. Let’s think dot by dot: Hidden computation in transformer language mod- els,

  17. [17]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    URL https://arxiv.org/abs/2307.09288. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need,

  18. [18]

    Attention Is All You Need

    URL https://arxiv.org/ abs/1706.03762. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of- thought prompting elicits reasoning in large language models,

  19. [19]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    URL https://arxiv.org/abs/ 2201.11903. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models,

  20. [20]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    URL https://arxiv.org/abs/2305.10601. Zhang, H., Liu, Z., Zhao, Y ., Zheng, J., Zhuang, C., Gu, J., and Chen, G. Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster,

  21. [21]

    Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster

    URL https://arxiv.org/abs/2311.08263. 9 Compressed Chain of Thought A. Varying the autoregressive layer Our method CCOT autoregressively generates contemplation tokens by using the hidden state at the lth layer at index i as the input embedding at index i +

  22. [22]

    NONE refers to the baseline where no contemplation tokens are decoded during inference

    Accuracy on GSM8K with our method CCOT with a com- pression ratio of r = 0.05 when varying the autoregressive layer l. NONE refers to the baseline where no contemplation tokens are decoded during inference. B. Further Theoretical Considerations In this section, we formalize the two insights outlined in Section 6.2. We note that an analysis of the enhanced...

  23. [23]

    (2024) below

    We formally introduce the new class of problems and outline the assumptions made by Goyal et al. (2024) below. Assumption B.1. (structure of underlying task) Assume a vocabulary V and a embedding dimension of d. Let ◦ be a genetic 2-ary operator on the embedding space Rd. For a given input length N, define the class of functions FM,K to be the set of all ...