Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Chengzhengxu Li; Daiting Shi; Guoxin Ma; Kecheng Chen; Yan Wang; Yibing Liu; Yueyang Zhang; Yu Liang; Zhaohan Zhang; Zhiyuan Sun

arxiv: 2605.28713 · v1 · pith:QPKWXJ4Enew · submitted 2026-05-27 · 💻 cs.AI

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Guoxin Ma , Yibing Liu , Chengzhengxu Li , Yu Liang , Yan Wang , Yueyang Zhang , Kecheng Chen , Zhaohan Zhang

show 2 more authors

Zhiyuan Sun Daiting Shi

This is my paper

Pith reviewed 2026-06-29 12:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords context compressionlong-context QAthinking tracesLLM reasoningreward-driven optimizationinference accelerationquestion answering

0 comments

The pith

An LLM's own reasoning traces can compress long contexts more effectively than dedicated compression methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that prompting a reasoning model to generate thinking traces about a task turns those traces into a compressed version of the long input context. This approach, called Thinking as Compression, avoids the need for separate compression modules or specialized training. A constrained version uses reward optimization to ensure the traces are short and focused, leading to better results on question-answering tasks with long documents. Readers should care because it implies that existing models already possess an internal ability to summarize and retain key information through reasoning, potentially simplifying long-context handling.

Core claim

We reveal that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We derive Thinking as Compression (TaC) that directly prompts the thinking model to generate thinking traces as the shortened context. Further, TaC-C leverages a reward-driven optimization framework to elicit compact and controllable compressed context, outperforming baselines on long-context QA benchmarks at 4x and 8x ratios.

What carries the argument

Thinking traces generated by direct prompting of the model, refined via reward-driven optimization to enforce length control and retain task information.

If this is right

TaC-C outperforms the strongest baseline by 17.4% F1 and 15.7% EM at 4x compression.
At 8x compression the gains rise to 23.4% F1 and 21.7% EM.
Thinking traces require no separate compression module or task-specific training.
Reward optimization controls length while retaining relevant information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests reasoning models may have an innate compression capability that could be unlocked for other long-input tasks.
Future systems might integrate thinking-based compression to reduce inference costs without additional modules.
Testing on tasks beyond QA could reveal if the approach generalizes or relies on question-specific structures.

Load-bearing premise

Optimizing the thinking process with rewards will produce traces that preserve all task-relevant details without hidden shortcuts that only succeed on the specific test questions used.

What would settle it

If TaC-C is evaluated on a held-out long-context QA dataset with questions crafted to expose potential shortcuts, and its scores fall below the strongest baseline, the claim of superior compression would be falsified.

Figures

Figures reproduced from arXiv: 2605.28713 by Chengzhengxu Li, Daiting Shi, Guoxin Ma, Kecheng Chen, Yan Wang, Yibing Liu, Yueyang Zhang, Yu Liang, Zhaohan Zhang, Zhiyuan Sun.

**Figure 2.** Figure 2: Pilot study of Thinking as Compression. Token retention ratio (compressed length / original length) measures the degree of compression. To examine whether raw thinking traces can serve as compressed contexts, we instantiate a prompt-only variant of Thinking as Compression, denoted as TaC-Vanilla. Given (q, C) and a target budget, TaC-Vanilla prompts Qwen3-8B to generate a thinking trace o, which serves … view at source ↗

**Figure 3.** Figure 3: TaC-C framework. Given a long context C, query q, and budget B, the Thinker generates [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Exact Match (EM) scores on four datasets across different model scales. To examine how the scale of the Thinker Model affects context compression, we fix Qwen3-8B as the Answer Model and train TaC-C with Qwen3-family Thinkers of different scales, from 1.7B to 14B [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Training curves of the three reward components in TaC-C: utility reward, budget control [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core claim is that reasoning traces can double as context compression with reward tuning delivering 17-23% gains, but the abstract leaves the reward setup and baseline fairness too vague to trust the results without the full methods.

read the letter

The main point is that this work treats a model's own thinking traces as the compressed context instead of training separate compressor modules. TaC starts with plain prompting of thinking output, then TaC-C adds reward-driven optimization to keep traces short and on-task. On four long-context QA benchmarks it reports beating the strongest prior method by 17.4% and 23.4% F1 at 4x and 8x compression, with similar lifts in exact match.

That reuse of existing reasoning models without new architectures or training regimes is the clearest novelty. The constrained optimization step to handle length and shortcuts is a practical addition that the raw thinking version apparently needs.

The soft spot is exactly the one the stress-test flags. The abstract gives no description of how the reward is computed, whether optimization runs on train or test splits, or whether the baselines received equivalent tuning. If the reward directly incorporates F1 or EM on the evaluation data, the large deltas could reflect benchmark-specific shortcuts rather than genuine information-preserving compression. Without those controls the performance numbers are hard to interpret.

The citation pattern looks standard for the area, but the lack of statistical details or prompt-sensitivity checks in the abstract keeps the soundness low. The central argument holds up only if the full paper shows the reward does not create circular or overfit behavior.

This is aimed at people working on long-context serving and inference efficiency. A reader already thinking about reward models or test-time optimization would get the most out of it. The work deserves a serious referee to check the experimental details, even though the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLMs' intrinsic thinking traces can serve as compressed context (TaC paradigm) without dedicated modules, and introduces TaC-C, a reward-driven optimization to produce compact, controllable traces. Experiments on four long-context QA benchmarks show TaC-C outperforming baselines, with gains of 17.4% and 23.4% in average F1 (and 15.7%, 21.7% in EM) at 4x and 8x compression ratios over the strongest competitor.

Significance. If the reported gains prove robust to controls for optimization procedure and evaluation, the result would be significant for showing that reasoning models have built-in compression ability, crediting the direct prompting of thinking traces and the constrained optimization framework as a lightweight alternative to complex compressor training. This could simplify context handling in LLM inference.

major comments (2)

[Abstract] Abstract: The headline performance deltas (17.4%/23.4% F1 at 4x/8x) rest on TaC-C's reward-driven optimization, yet no information is supplied on reward formulation, whether optimization uses train vs. test splits of the four QA benchmarks, or equivalent optimization applied to baselines. This is load-bearing for the central claim that the traces represent general compression rather than benchmark-specific shortcuts.
[§3] §3 (TaC-C framework): The description of the reward-driven optimization does not specify how the reward is derived from task metrics (F1/EM) or whether it incorporates length penalties and information-preservation terms in a way that prevents encoding of answers or dataset artifacts. Without this, the outperformance cannot be distinguished from optimization for evaluation artifacts.

minor comments (1)

[Abstract] The abstract and early sections use 'thinking model' and 'reasoning model' interchangeably without a precise definition or citation to the specific model family used in experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in the TaC-C optimization details. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance deltas (17.4%/23.4% F1 at 4x/8x) rest on TaC-C's reward-driven optimization, yet no information is supplied on reward formulation, whether optimization uses train vs. test splits of the four QA benchmarks, or equivalent optimization applied to baselines. This is load-bearing for the central claim that the traces represent general compression rather than benchmark-specific shortcuts.

Authors: We agree that the current manuscript does not supply sufficient detail on these points. In the revised version we will expand both the abstract and §3 to specify the reward formulation, confirm that optimization was performed only on the training splits of the four benchmarks (with validation used for hyperparameter selection), and report that we applied an analogous reward-driven procedure to the strongest baseline for direct comparison. These additions will make explicit that the reported gains are not attributable to test-set leakage or un-controlled optimization differences. revision: yes
Referee: [§3] §3 (TaC-C framework): The description of the reward-driven optimization does not specify how the reward is derived from task metrics (F1/EM) or whether it incorporates length penalties and information-preservation terms in a way that prevents encoding of answers or dataset artifacts. Without this, the outperformance cannot be distinguished from optimization for evaluation artifacts.

Authors: We acknowledge that §3 currently lacks an explicit mathematical description of the reward. The reward combines the downstream F1/EM score with a length penalty and an auxiliary term that penalizes direct leakage of answer spans (detected via string matching against the gold answer). We will revise §3 to present the full reward equation and explain how the penalty terms discourage answer encoding and dataset-specific shortcuts. This change will allow readers to evaluate whether the gains reflect general compression rather than artifact exploitation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent benchmarks

full rationale

The paper introduces TaC as a conceptual reframing (thinking traces as compressed context) and TaC-C via reward optimization, then reports empirical gains on four long-context QA benchmarks. No equations, derivations, or self-citations are shown that reduce the reported F1/EM improvements to the optimization inputs by construction. The performance numbers are presented as held-out experimental comparisons rather than tautological outputs of the fitting process itself. This is the common case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical observation that thinking traces preserve task information and that a simple reward objective can control their length and quality. No explicit free parameters, axioms, or invented entities are named in the abstract beyond the standard assumption that LLMs can follow prompts and that benchmark scores reflect real capability.

axioms (2)

domain assumption LLMs can follow prompts to produce task-relevant thinking traces
Implicit in the TaC prompting step described in the abstract.
domain assumption Reward-driven optimization can improve compactness without harming downstream accuracy
Central to the TaC-C variant.

pith-pipeline@v0.9.1-grok · 5776 in / 1205 out tokens · 36156 ms · 2026-06-29T12:07:11.781507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 20 canonical work pages · 11 internal anchors

[1]

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

2024
[2]

Chirkova, T

N. Chirkova, T. Formal, V . Nikoulina, and S. Clinchant. Provence: efficient and robust context pruning for retrieval-augmented generation. 2025

2025
[3]

Y . Dai, J. Lian, Y . Huang, W. Zhang, M. Zhou, M. Wu, X. Xie, and H. Liao. Pretraining context compressor for large language models with embedding-based memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 28715–28732, 2025

2025
[4]

El-Kishky, A

A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Dohan, F. Song, H. Lightman, I. Clavera, J. Pachocki, et al. Competitive programming with large reasoning models. arXiv preprint arXiv:2502.06807, 2025

work page arXiv 2025
[5]

No Mean Feat: Simple, Strong Baselines for Context Compression

Y . Feldman and Y . Artzi. Simple context compression: Mean-pooling and multi-ratio training. arXiv preprint arXiv:2510.20797, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

L. Feng, Z. Xue, T. Liu, and B. An. Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

T. Ge, J. Hu, L. Wang, X. Wang, S.-Q. Chen, and F. Wei. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023

work page arXiv 2023
[8]

Gemma 4 26b a4b model card

Google DeepMind. Gemma 4 26b a4b model card. https://ai.google.dev/gemma/docs/ core/model_card_4, 2026. Accessed: 2026-05-04

2026
[9]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

R. Guo, Y . Liu, G. Ma, Y . Wang, Y . Zhang, L. Xia, K. Chen, Z. Sun, and D. Shi. When less is more: The llm scaling paradox in context compression. 2026

2026
[12]

Ho, A.-K

X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

2020
[13]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022

2022
[14]

Hwang, S

T. Hwang, S. Cho, S. Jeong, H. Song, S. Y . Han, and J. C. Park. Exit: Context-aware extractive compression for enhancing retrieval-augmented generation. 2024

2024
[15]

Jiang, Q

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 1658–1677, 2024

2024
[16]

Jiang and M

Y . Jiang and M. Bansal. Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop qa. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2726–2736, 2019

2019
[17]

B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Kwiatkowski, J

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019

2019
[19]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[20]

D. Li, S. Cao, T. Griggs, S. Liu, X. Mo, E. Tang, S. Hegde, K. Hakhamaneshi, S. G. Patil, and M. Zaharia. Llms can easily learn to reason from demonstrations structure, not content, is what matters! 2025

2025
[21]

T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen. Long-context llms struggle with long in-context learning. arXiv preprint arXiv:2404.02060, 2024

work page arXiv 2024
[22]

X. Li, G. Dong, J. Jin, Y . Zhang, Y . Zhou, Y . Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search-enhanced large reasoning models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

2025
[23]

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947–22970, 2024

2024
[24]

Z. Li, C. Huang, X. Wang, H. Hu, C. Wyeth, D. Bu, Q. Yu, W. Gao, X. Liu, and M. Li. Understanding is compression. arXiv preprint arXiv:2407.07723, 2024

work page arXiv 2024
[25]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023
[26]

T. Lin, J. Xie, S. Yuan, and D. Yang. Implicit reasoning in transformers is reasoning through shortcuts. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9470–9487, 2025

2025
[27]

Maharana, D.-H

A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y . Fang. Evaluating very long-term conversational memory of llm agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 13851–13870, 2024

2024
[28]

Z. Pan, Q. Wu, H. Jiang, M. Xia, X. Luo, J. Zhang, Q. Lin, V . Rühle, Y . Yang, C.-Y . Lin, et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 963–981, 2024

2024
[29]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

HybridFlow: A Flexible and Efficient RLHF Framework

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

J. Tang, S. Liu, Z. Zhang, Q. Lv, R. Zhao, T. Lu, L. Liu, H. Chen, Y . Yuan, H.-T. Zheng, et al. Read as human: Compressing context via parallelizable close reading and skimming. arXiv preprint arXiv:2602.01840, 2026

work page arXiv 2026
[32]

J. Tang, S. Liu, Z. Zhang, Y . Yuan, L. Zheng, W. Su, and B. Zheng. Comi: Coarse-to-fine context compression via marginal information gain. arXiv preprint arXiv:2602.01719, 2026

work page arXiv 2026
[33]

J. Tang, Z. Zhang, S. Wu, J. Ye, L. Bai, Z. Wang, T. Lu, L. Hai, Y . Zhao, H.-T. Zheng, et al. Gmsa: Enhancing context compression via group merging and layer semantic alignment. arXiv preprint arXiv:2505.12215, 2025

work page arXiv 2025
[34]

J. F. Ton, M. F. Taufiq, and Y . Liu. Understanding chain-of-thought in llms through information theory. 2024. 11

2024
[35]

Trivedi, N

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop ques- tions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

2022
[36]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Z. Wang, C. Li, Z. Yang, Q. Liu, Y . Hao, X. Chen, D. Chu, and D. Sui. Analyzing chain-of-thought prompting in black-box large language models via estimated v-information. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 893–903, 2024

2024
[38]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[39]

F. Xu, W. Shi, and E. Choi. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408, 2023

work page arXiv 2023
[40]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, and C. Lv. Qwen3 technical report. 2025

2025
[41]

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018
[42]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Zhang, Z

P. Zhang, Z. Liu, S. Xiao, N. Shao, Q. Ye, and Z. Dou. Long context compression with activation beacon. arXiv preprint arXiv:2401.03462, 2024

work page arXiv 2024
[45]

Group Sequence Policy Optimization

C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y . Liu, R. Men, A. Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Tredici a tavola

K. Zhu, X. Feng, X. Du, Y . Gu, W. Yu, H. Wang, Q. Chen, Z. Chu, J. Chen, and B. Qin. An information bottleneck perspective for effective noise filtering on retrieval-augmented generation. 2024. A Implementation Details. All reinforcement learning experiments are implemented with VeRL [30] using a pure GRPO objec- tive [29, 10], without training a separat...

2024

[1] [1]

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

2024

[2] [2]

Chirkova, T

N. Chirkova, T. Formal, V . Nikoulina, and S. Clinchant. Provence: efficient and robust context pruning for retrieval-augmented generation. 2025

2025

[3] [3]

Y . Dai, J. Lian, Y . Huang, W. Zhang, M. Zhou, M. Wu, X. Xie, and H. Liao. Pretraining context compressor for large language models with embedding-based memory. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 28715–28732, 2025

2025

[4] [4]

El-Kishky, A

A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Dohan, F. Song, H. Lightman, I. Clavera, J. Pachocki, et al. Competitive programming with large reasoning models. arXiv preprint arXiv:2502.06807, 2025

work page arXiv 2025

[5] [5]

No Mean Feat: Simple, Strong Baselines for Context Compression

Y . Feldman and Y . Artzi. Simple context compression: Mean-pooling and multi-ratio training. arXiv preprint arXiv:2510.20797, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

L. Feng, Z. Xue, T. Liu, and B. An. Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

T. Ge, J. Hu, L. Wang, X. Wang, S.-Q. Chen, and F. Wei. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023

work page arXiv 2023

[8] [8]

Gemma 4 26b a4b model card

Google DeepMind. Gemma 4 26b a4b model card. https://ai.google.dev/gemma/docs/ core/model_card_4, 2026. Accessed: 2026-05-04

2026

[9] [9]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

R. Guo, Y . Liu, G. Ma, Y . Wang, Y . Zhang, L. Xia, K. Chen, Z. Sun, and D. Shi. When less is more: The llm scaling paradox in context compression. 2026

2026

[12] [12]

Ho, A.-K

X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

2020

[13] [13]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022

2022

[14] [14]

Hwang, S

T. Hwang, S. Cho, S. Jeong, H. Song, S. Y . Han, and J. C. Park. Exit: Context-aware extractive compression for enhancing retrieval-augmented generation. 2024

2024

[15] [15]

Jiang, Q

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 1658–1677, 2024

2024

[16] [16]

Jiang and M

Y . Jiang and M. Bansal. Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop qa. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2726–2736, 2019

2019

[17] [17]

B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Kwiatkowski, J

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019

2019

[19] [19]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023

[20] [20]

D. Li, S. Cao, T. Griggs, S. Liu, X. Mo, E. Tang, S. Hegde, K. Hakhamaneshi, S. G. Patil, and M. Zaharia. Llms can easily learn to reason from demonstrations structure, not content, is what matters! 2025

2025

[21] [21]

T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen. Long-context llms struggle with long in-context learning. arXiv preprint arXiv:2404.02060, 2024

work page arXiv 2024

[22] [22]

X. Li, G. Dong, J. Jin, Y . Zhang, Y . Zhou, Y . Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search-enhanced large reasoning models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

2025

[23] [23]

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947–22970, 2024

2024

[24] [24]

Z. Li, C. Huang, X. Wang, H. Hu, C. Wyeth, D. Bu, Q. Yu, W. Gao, X. Liu, and M. Li. Understanding is compression. arXiv preprint arXiv:2407.07723, 2024

work page arXiv 2024

[25] [25]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023

[26] [26]

T. Lin, J. Xie, S. Yuan, and D. Yang. Implicit reasoning in transformers is reasoning through shortcuts. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9470–9487, 2025

2025

[27] [27]

Maharana, D.-H

A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y . Fang. Evaluating very long-term conversational memory of llm agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 13851–13870, 2024

2024

[28] [28]

Z. Pan, Q. Wu, H. Jiang, M. Xia, X. Luo, J. Zhang, Q. Lin, V . Rühle, Y . Yang, C.-Y . Lin, et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 963–981, 2024

2024

[29] [29]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

HybridFlow: A Flexible and Efficient RLHF Framework

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

J. Tang, S. Liu, Z. Zhang, Q. Lv, R. Zhao, T. Lu, L. Liu, H. Chen, Y . Yuan, H.-T. Zheng, et al. Read as human: Compressing context via parallelizable close reading and skimming. arXiv preprint arXiv:2602.01840, 2026

work page arXiv 2026

[32] [32]

J. Tang, S. Liu, Z. Zhang, Y . Yuan, L. Zheng, W. Su, and B. Zheng. Comi: Coarse-to-fine context compression via marginal information gain. arXiv preprint arXiv:2602.01719, 2026

work page arXiv 2026

[33] [33]

J. Tang, Z. Zhang, S. Wu, J. Ye, L. Bai, Z. Wang, T. Lu, L. Hai, Y . Zhao, H.-T. Zheng, et al. Gmsa: Enhancing context compression via group merging and layer semantic alignment. arXiv preprint arXiv:2505.12215, 2025

work page arXiv 2025

[34] [34]

J. F. Ton, M. F. Taufiq, and Y . Liu. Understanding chain-of-thought in llms through information theory. 2024. 11

2024

[35] [35]

Trivedi, N

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop ques- tions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

2022

[36] [36]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Z. Wang, C. Li, Z. Yang, Q. Liu, Y . Hao, X. Chen, D. Chu, and D. Sui. Analyzing chain-of-thought prompting in black-box large language models via estimated v-information. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 893–903, 2024

2024

[38] [38]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[39] [39]

F. Xu, W. Shi, and E. Choi. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408, 2023

work page arXiv 2023

[40] [40]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, and C. Lv. Qwen3 technical report. 2025

2025

[41] [41]

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018

[42] [42]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Zhang, Z

P. Zhang, Z. Liu, S. Xiao, N. Shao, Q. Ye, and Z. Dou. Long context compression with activation beacon. arXiv preprint arXiv:2401.03462, 2024

work page arXiv 2024

[45] [45]

Group Sequence Policy Optimization

C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y . Liu, R. Men, A. Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Tredici a tavola

K. Zhu, X. Feng, X. Du, Y . Gu, W. Yu, H. Wang, Q. Chen, Z. Chu, J. Chen, and B. Qin. An information bottleneck perspective for effective noise filtering on retrieval-augmented generation. 2024. A Implementation Details. All reinforcement learning experiments are implemented with VeRL [30] using a pure GRPO objec- tive [29, 10], without training a separat...

2024