Efficient Reasoning with Hidden Thinking

Jiuxiang Gu; Pu Zhao; Xiangxi Shi; Xuan Shen; Yanzhi Wang; Yizhou Wang; Yufa Zhou

arxiv: 2501.19201 · v2 · submitted 2025-01-31 · 💻 cs.CL · cs.AI· cs.LG

Efficient Reasoning with Hidden Thinking

Xuan Shen , Yizhou Wang , Yufa Zhou , Xiangxi Shi , Pu Zhao , Yanzhi Wang , Jiuxiang Gu This is my paper

Pith reviewed 2026-05-23 04:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords chain-of-thought compressionhidden thinking tokensmutual informationmultimodal large language modelsreasoning efficiencyadaptive interpreterzero-shot accuracy

0 comments

The pith

Compressing chain-of-thought into hidden tokens preserves reasoning accuracy when mutual information is retained.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Heima, a framework that condenses lengthy textual chain-of-thought reasoning in multimodal models into a small set of abstract thinking tokens. A theoretical analysis quantifies the information gap from compression and shows that reasoning capability holds when non-trivial mutual information remains between the tokens and the original CoT. An adaptive interpreter reconstructs variable-length text from the tokens, and experiments on reasoning benchmarks confirm maintained or improved zero-shot accuracy with coherent reconstructed steps.

Core claim

Reasoning capability is preserved when non-trivial mutual information is retained between the compressed thinking tokens and the original CoT; experiments show maintained or better zero-shot accuracy.

What carries the argument

The Heima compression framework that condenses CoTs into abstract thinking tokens, with the information-theoretic gap quantified via mutual information retention.

If this is right

Reasoning efficiency increases because verbose text is replaced by fewer tokens.
The adaptive interpreter can reconstruct coherent reasoning progress from the compressed tokens.
Zero-shot accuracy on diverse benchmarks stays the same or rises.
The approach supports development of scalable latent reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar compression may apply to non-reasoning tasks if mutual information can be maintained.
Internalizing reasoning into latent tokens could reduce output length across many model applications.
Varying the number of thinking tokens while tracking mutual information offers a direct way to test compression limits.

Load-bearing premise

That the measured mutual information retained between tokens and original CoT is both accurate and sufficient to ensure task-critical reasoning steps survive the compression.

What would settle it

A benchmark result where high mutual information is retained yet zero-shot accuracy drops substantially on a reasoning task would falsify the preservation claim.

Figures

Figures reproduced from arXiv: 2501.19201 by Jiuxiang Gu, Pu Zhao, Xiangxi Shi, Xuan Shen, Yanzhi Wang, Yizhou Wang, Yufa Zhou.

**Figure 1.** Figure 1: Visualization of our whole framework. Image and question are fed to the Heima Encoder (MLLM) for reasoning encoding and final conclusion generation. Encoded thinking tokens and question are then fed to the Heima Decoder (LLM) for the CoT reconstruction. In the reconstructed reasoning process, the caption decoder successfully retrieves image information, describing it as ”The image shows a sleek, modern spo… view at source ↗

**Figure 2.** Figure 2: Visualization of the CoTs progressive encoding process. <Thinking of Reasoning> for the reasoning stage. Note that different samples with varying values of i share the same thinking token <CoT>(k) at the same stage k. The updated dataset is then explained as follows, n X(i) v , X(i) q , {<CoT>(k)} Ki k=1, Y (i) a oN i=1 . (3) We then continue fine-tuning the model with the objective: max θ 1 N XN i=1 log … view at source ↗

**Figure 3.** Figure 3: Visualization of the training progress for Heima Decoder. Thinking token is caught from the s th last hidden state of Heima Encoder and replaces the embedding of special token <CoT>(s). sentations, each with a unique thinking token. Each encoded CoT stage or each kind of thinking token requires one corresponding decoder or interpreter. Thus, to decode multiple thinking tokens, we need to train multiple de… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Results of evaluation by GPT-4o for assessing the average similarity score (1-5) between the reconstructed reasoning processes from the thinking tokens and the original CoTs. 4.3. Interpretability Analysis To verify the effectiveness of hidden representation encoding and improve interpretability of the framework, we evaluate the performance of Heima Decoder by assessing the similarity between the recons… view at source ↗

**Figure 7.** Figure 7: Ablation study of average accuracy on 6 datasets for varying retention ratios of thinking tokens relative to original CoT. 0 50 100 150 200 250 300 Ratio 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 # of Token on MMStar 181 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study for number of generated tokens on MMStar with varying retention ratios of thinking tokens relative to the original CoT. Baseline (i.e., LLaVA-CoT) generates 181 tokens. to 90% retention ratio, accuracy fluctuates irregularly. There is no consistent pattern emerging, highlighting the unpredictable relationship between retention ratio and accuracy. Meanwhile, as shown in [PITH_FULL_IMAGE:fi… view at source ↗

read the original abstract

Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose Heima (as hidden llama), an effective CoT compression framework that condenses lengthy CoTs into a small set of abstract thinking tokens, preserving essential reasoning while removing redundancy. We then conduct a theoretical analysis from an information-theoretic perspective, quantifying the information gap induced by compression, showing that reasoning capability is preserved when non-trivial mutual information is retained. To further explore and quantify this information gap, we design the adaptive interpreter that maps thinking tokens back to variable-length textual sequences, thereby reconstructing the reasoning process. Experiments across diverse reasoning benchmarks demonstrate that Heima improves reasoning efficiency, while maintaining or even achieving better zero-shot accuracy. Moreover, the interpreter reconstructs coherent reasoning progresses from compressed thinking tokens, revealing that the information gap is minimal and validating the effectiveness of the proposed framework. This work paves the way for scalable latent reasoning models and advances our understanding of efficient reasoning processes in large models. Code: https://github.com/shawnricecake/Heima

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Heima compresses CoT into hidden tokens with benchmark accuracy holding up, but mutual information does not establish that critical reasoning steps survive.

read the letter

The main point is that this paper compresses chain-of-thought into a small set of abstract thinking tokens, adds an interpreter to reconstruct text from them, and reports that zero-shot accuracy on reasoning benchmarks stays the same or improves while using fewer tokens. They frame the compression with an information-theoretic argument that non-trivial mutual information between the tokens and the original CoT is enough to keep reasoning intact. The interpreter producing coherent outputs is a concrete check they include. Code is released, which helps with verification. What stands out as new is the explicit combination of hidden tokens, the gap analysis, and the reconstruction step rather than just token pruning. The experiments across multiple benchmarks give a practical sense of the efficiency gain. The soft spot is the central claim about information preservation. Mutual information measures statistical dependence but does not identify whether the retained bits include the specific deductions or calculations needed for the answer. The paper shows the gap is small and reconstructions look sensible, yet there are no ablations that test whether accuracy would drop if the logically necessary steps were removed versus incidental correlations. Without those controls or clearer details on how the mutual information is computed and how the interpreter is trained, it remains possible that the accuracy comes from the base model rather than the compression method itself. This work is aimed at groups focused on lowering token cost and latency in multi-step LLM inference. A reader experimenting with latent reasoning or compression techniques could try the interpreter design and the reported numbers. It is not ready to shift practice without tighter evidence on what information is actually kept. I would send it for peer review because the experiments are testable with the public code and the idea is direct enough that referees can evaluate the information claim on its merits.

Referee Report

2 major / 1 minor

Summary. The paper introduces Heima, a CoT compression method that condenses verbose textual reasoning in MLLMs into a small set of abstract thinking tokens. It provides an information-theoretic analysis claiming that reasoning capability is preserved whenever non-trivial mutual information is retained between the compressed tokens and the original CoT. An adaptive interpreter is proposed to map tokens back to variable-length text for reconstruction. Experiments on reasoning benchmarks report maintained or improved zero-shot accuracy, with the interpreter producing coherent reconstructions that indicate a minimal information gap.

Significance. If the central claims hold, the work would contribute to more efficient latent-space reasoning in large multimodal models by reducing token overhead without sacrificing performance. The provision of code on GitHub is a positive step toward reproducibility. The information-theoretic framing and interpreter design offer a potential lens for understanding what is preserved under compression, though this depends on the validity of the MI-to-reasoning link.

major comments (2)

[theoretical analysis] Theoretical analysis section: the claim that non-trivial mutual information between compressed thinking tokens and original CoT is sufficient to preserve reasoning capability is not supported by controls or ablations showing that retained MI corresponds to task-critical logical steps rather than incidental statistical correlations. Mutual information quantifies dependence but does not identify which specific deductions survive compression.
[experiments] Experiments section: the reported maintained or improved zero-shot accuracy lacks ablations that isolate whether performance stems from retention of necessary reasoning steps versus other factors (e.g., model capacity or prompt effects). No details are given on how the mutual-information gap is measured or how the interpreter is trained without introducing circular dependence on the same data.

minor comments (1)

[abstract] Abstract: the parenthetical '(as hidden llama)' for the acronym Heima is unclear and should be expanded or removed for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, clarifying our theoretical framing and experimental design while committing to revisions where needed to strengthen the manuscript.

read point-by-point responses

Referee: [theoretical analysis] Theoretical analysis section: the claim that non-trivial mutual information between compressed thinking tokens and original CoT is sufficient to preserve reasoning capability is not supported by controls or ablations showing that retained MI corresponds to task-critical logical steps rather than incidental statistical correlations. Mutual information quantifies dependence but does not identify which specific deductions survive compression.

Authors: We agree that mutual information measures statistical dependence without isolating specific logical deductions. Our analysis uses MI as a necessary (not sufficient) condition for preservation: if non-trivial MI is retained, the compressed tokens contain enough information for the downstream model to achieve the observed reasoning performance. This is supported by the consistent zero-shot accuracy across benchmarks rather than by direct identification of individual steps. We will revise the theoretical section to explicitly discuss this distinction and the limitations of MI as a proxy, and we will add a targeted discussion of why incidental correlations alone are unlikely to explain the maintained performance given the compression ratios achieved. revision: partial
Referee: [experiments] Experiments section: the reported maintained or improved zero-shot accuracy lacks ablations that isolate whether performance stems from retention of necessary reasoning steps versus other factors (e.g., model capacity or prompt effects). No details are given on how the mutual-information gap is measured or how the interpreter is trained without introducing circular dependence on the same data.

Authors: We will add ablations in the revised experiments section that control for prompt effects and model capacity (e.g., fixed-prompt baselines and capacity-matched uncompressed runs) to better isolate the contribution of retained reasoning information. The MI gap is quantified through the adaptive interpreter's reconstruction fidelity and coherence metrics on held-out examples. The interpreter is trained on a separate corpus distinct from the evaluation benchmarks to avoid circular dependence; we will include these training details, dataset splits, and hyperparameter choices in the updated manuscript. revision: yes

Circularity Check

1 steps flagged

Reasoning preservation claim reduces to mutual information retention by definition

specific steps

self definitional [Abstract]
"We then conduct a theoretical analysis from an information-theoretic perspective, quantifying the information gap induced by compression, showing that reasoning capability is preserved when non-trivial mutual information is retained."

The analysis defines the information gap as the reduction in mutual information between compressed thinking tokens and the original CoT; the preservation result is then stated as holding whenever this mutual information remains non-trivial. The claimed theoretical result is therefore equivalent to the definition of retained mutual information rather than derived from separate principles about reasoning structure.

full rationale

The paper's theoretical analysis quantifies the information gap via mutual information and then asserts preservation of reasoning capability exactly when non-trivial MI is retained. This link is definitional: the retained information is defined as the MI, so the 'showing' that capability is preserved follows immediately from the quantification rather than from an independent argument about which reasoning steps survive. Experiments demonstrate maintained accuracy but do not supply an independent derivation of the preservation claim. No self-citations or other load-bearing reductions appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on an unstated assumption that mutual information is a sufficient proxy for reasoning fidelity.

pith-pipeline@v0.9.0 · 5754 in / 1058 out tokens · 60403 ms · 2026-05-23T04:43:11.219431+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
A Survey of Scaling in Large Language Model Reasoning
cs.AI 2025-04 unverdicted novelty 3.0

A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 5 Pith papers · 17 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A versatile vision- language model for understanding, localization, text read- ing, and beyond. arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Hopping too late: Exploring the limitations of large language models on multi-hop queries

Biran, E., Gottesman, D., Yang, S., Geva, M., and Glober- son, A. Hopping too late: Exploring the limitations of large language models on multi-hop queries. arXiv preprint arXiv:2406.12775,

work page arXiv
[4]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Cheng, J. and Van Durme, B. Compressed chain of thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Adapting language models to compress contexts

Chevalier, A., Wettig, A., Ajith, A., and Chen, D. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788,

work page arXiv
[7]

S., Bansal, M., and Chen, C

Deng, A., Chen, T., Yu, S., Yang, T., Spencer, L., Tian, Y ., Mian, A. S., Bansal, M., and Chen, C. Motion-grounded video reasoning: Understanding and perceiving motion at pixel level. arXiv preprint arXiv:2411.09921, 2024a. Deng, Y ., Choi, Y ., and Shieber, S. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint ar...

work page arXiv
[8]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma, A., and Goodman, N. D. Stream of search (sos): Learning to search in language. arXiv preprint arXiv:2404.03683,

work page arXiv
[10]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

As- sociation for Computational Linguistics. URL https: //aclanthology.org/2024.acl-long.91. 9 Efficient Reasoning with Hidden Thinking Khot, T., Trivedi, H., Finlayson, M., Fu, Y ., Richardson, K., Clark, P., and Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Cllms: Consistency large language models

Kou, S., Hu, L., He, Z., Deng, Z., and Zhang, H. Cllms: Consistency large language models. arXiv preprint arXiv:2403.00835,

work page arXiv
[14]

Eagle-2: Faster inference of language models with dynamic draft trees

Li, Y ., Wei, F., Zhang, C., and Zhang, H. Eagle-2: Faster inference of language models with dynamic draft trees. arXiv preprint arXiv:2406.16858,

work page arXiv
[15]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek- v3 technical report. arXiv preprint arXiv:2412.19437 , 2024a. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tun- ing. Advances in neural information processing systems , 36, 2024b. Liu, Y ., Li, H., Du, K., Yao, J., Cheng, Y ., Hua...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

and Sabharwal, A

Merrill, W. and Sabharwal, A. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923,

work page arXiv
[17]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Meta. Introducing llama 3.1: Our most capable models to date. blog, 2024a. URL https://ai.meta.com/ blog/meta-llama-3-1/ . Meta. Llama 3.2: Revolutionizing edge ai and vi- sion with open, customizable models. blog, 2024b. URL https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/ . Meta. torchtune: Pytorch’s finetuning library. soft-...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Detgpt: Detect what you need via reasoning

Pi, R., Gao, J., Diao, S., Pan, R., Dong, H., Zhang, J., Yao, L., Han, J., Xu, H., Kong, L., et al. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167,

work page arXiv
[19]

C., Rao, N., and Van Durme, B

Qin, G., Rosset, C., Chau, E. C., Rao, N., and Van Durme, B. Nugget 2d: Dynamic contextual compression for scaling decoder-only language models. arXiv preprint arXiv:2310.02409,

work page arXiv
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces.arXiv preprint arXiv:2410.09918,

Su, D., Sukhbaatar, S., Rabbat, M., Tian, Y ., and Zheng, Q. Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces.arXiv preprint arXiv:2410.09918,

work page arXiv
[22]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Wang, P., Li, L., Shao, Z., Xu, R., Dai, D., Li, Y ., Chen, D., Wu, Y ., and Sui, Z. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Xu, G., Jin, P., Hao, L., Song, Y ., Sun, L., and Yuan, L. Llava-o1: Let vision language models reason step-by- step. arXiv preprint arXiv:2411.10440,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Visa: Reasoning video object segmentation via large language models

Yan, C., Wang, H., Yan, S., Jiang, X., Hu, Y ., Kang, G., Xie, W., and Gavves, E. Visa: Reasoning video object segmentation via large language models. arXiv preprint arXiv:2407.11325,

work page arXiv
[25]

Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837,

Yang, S., Gribovskaya, E., Kassner, N., Geva, M., and Riedel, S. Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837,

work page arXiv
[26]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y ., Kwok, J. T., Li, Z., Weller, A., and Liu, W. Metamath: Boot- strap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Yue, X., Qu, X., Zhang, G., Fu, Y ., Huang, W., Sun, H., Su, Y ., and Chen, W. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster

Zhang, H., Liu, Z., Zhao, Y ., Zheng, J., Zhuang, C., Gu, J., and Chen, G. Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster. arXiv preprint arXiv:2311.08263,

work page arXiv
[29]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., and Artzi, Y . Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[30]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou, D., Sch ¨arli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A versatile vision- language model for understanding, localization, text read- ing, and beyond. arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Hopping too late: Exploring the limitations of large language models on multi-hop queries

Biran, E., Gottesman, D., Yang, S., Geva, M., and Glober- son, A. Hopping too late: Exploring the limitations of large language models on multi-hop queries. arXiv preprint arXiv:2406.12775,

work page arXiv

[4] [4]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Cheng, J. and Van Durme, B. Compressed chain of thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Adapting language models to compress contexts

Chevalier, A., Wettig, A., Ajith, A., and Chen, D. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788,

work page arXiv

[7] [7]

S., Bansal, M., and Chen, C

Deng, A., Chen, T., Yu, S., Yang, T., Spencer, L., Tian, Y ., Mian, A. S., Bansal, M., and Chen, C. Motion-grounded video reasoning: Understanding and perceiving motion at pixel level. arXiv preprint arXiv:2411.09921, 2024a. Deng, Y ., Choi, Y ., and Shieber, S. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint ar...

work page arXiv

[8] [8]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma, A., and Goodman, N. D. Stream of search (sos): Learning to search in language. arXiv preprint arXiv:2404.03683,

work page arXiv

[10] [10]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

As- sociation for Computational Linguistics. URL https: //aclanthology.org/2024.acl-long.91. 9 Efficient Reasoning with Hidden Thinking Khot, T., Trivedi, H., Finlayson, M., Fu, Y ., Richardson, K., Clark, P., and Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Cllms: Consistency large language models

Kou, S., Hu, L., He, Z., Deng, Z., and Zhang, H. Cllms: Consistency large language models. arXiv preprint arXiv:2403.00835,

work page arXiv

[14] [14]

Eagle-2: Faster inference of language models with dynamic draft trees

Li, Y ., Wei, F., Zhang, C., and Zhang, H. Eagle-2: Faster inference of language models with dynamic draft trees. arXiv preprint arXiv:2406.16858,

work page arXiv

[15] [15]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek- v3 technical report. arXiv preprint arXiv:2412.19437 , 2024a. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tun- ing. Advances in neural information processing systems , 36, 2024b. Liu, Y ., Li, H., Du, K., Yao, J., Cheng, Y ., Hua...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

and Sabharwal, A

Merrill, W. and Sabharwal, A. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923,

work page arXiv

[17] [17]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Meta. Introducing llama 3.1: Our most capable models to date. blog, 2024a. URL https://ai.meta.com/ blog/meta-llama-3-1/ . Meta. Llama 3.2: Revolutionizing edge ai and vi- sion with open, customizable models. blog, 2024b. URL https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/ . Meta. torchtune: Pytorch’s finetuning library. soft-...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Detgpt: Detect what you need via reasoning

Pi, R., Gao, J., Diao, S., Pan, R., Dong, H., Zhang, J., Yao, L., Han, J., Xu, H., Kong, L., et al. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167,

work page arXiv

[19] [19]

C., Rao, N., and Van Durme, B

Qin, G., Rosset, C., Chau, E. C., Rao, N., and Van Durme, B. Nugget 2d: Dynamic contextual compression for scaling decoder-only language models. arXiv preprint arXiv:2310.02409,

work page arXiv

[20] [20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces.arXiv preprint arXiv:2410.09918,

Su, D., Sukhbaatar, S., Rabbat, M., Tian, Y ., and Zheng, Q. Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces.arXiv preprint arXiv:2410.09918,

work page arXiv

[22] [22]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Wang, P., Li, L., Shao, Z., Xu, R., Dai, D., Li, Y ., Chen, D., Wu, Y ., and Sui, Z. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Xu, G., Jin, P., Hao, L., Song, Y ., Sun, L., and Yuan, L. Llava-o1: Let vision language models reason step-by- step. arXiv preprint arXiv:2411.10440,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Visa: Reasoning video object segmentation via large language models

Yan, C., Wang, H., Yan, S., Jiang, X., Hu, Y ., Kang, G., Xie, W., and Gavves, E. Visa: Reasoning video object segmentation via large language models. arXiv preprint arXiv:2407.11325,

work page arXiv

[25] [25]

Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837,

Yang, S., Gribovskaya, E., Kassner, N., Geva, M., and Riedel, S. Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837,

work page arXiv

[26] [26]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y ., Kwok, J. T., Li, Z., Weller, A., and Liu, W. Metamath: Boot- strap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Yue, X., Qu, X., Zhang, G., Fu, Y ., Huang, W., Sun, H., Su, Y ., and Chen, W. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster

Zhang, H., Liu, Z., Zhao, Y ., Zheng, J., Zhuang, C., Gu, J., and Chen, G. Fast chain-of-thought: A glance of future from parallel decoding leads to answers faster. arXiv preprint arXiv:2311.08263,

work page arXiv

[29] [29]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., and Artzi, Y . Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[30] [30]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou, D., Sch ¨arli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625,

work page internal anchor Pith review Pith/arXiv arXiv