pith. sign in

arxiv: 2606.13316 · v1 · pith:53MEIT2Anew · submitted 2026-06-11 · 💻 cs.AI

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

Pith reviewed 2026-06-27 06:38 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM reasoningreinforcement learningself-summarizationRLVRadaptive rolloutcontrastive evaluationreasoning trajectory
0
0 comments X

The pith

ReSum trains LLMs to insert self-summaries that shorten reasoning rollouts while improving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard RLVR training for long-horizon LLM reasoning tends to produce overly long trajectories that consume context and lose coherence. It introduces a mechanism allowing the model to decide on its own when to compress parts of the trajectory through self-summarization, with a contrastive rollout process that checks whether the compression actually helps the final answer. Pilot observations indicate that summaries lower token entropy and limit error spread from flawed prefixes. When integrated into the full training loop, the approach yields shorter outputs and higher success rates on reasoning tasks.

Core claim

ReSum is an RLVR framework in which the model learns to trigger self-summarization during generation. When the phrase appears, the method masks it to form a contrastive branch and compares the outcome against the original trajectory; at other positions it randomly inserts the phrase to form matched branches. A summarization-aware advantage then scores the pairs at finer granularity, so the policy learns both to reason correctly and to organize its own trajectory by compression when useful.

What carries the argument

The summarization-aware adaptive rollout mechanism, which creates contrastive masked and injected branches around self-summarization points and scores them with a dedicated advantage function.

If this is right

  • Models gain an internal way to manage context length instead of depending on external truncation rules.
  • Early mistakes become less likely to derail the entire trajectory once a corrective summary can be inserted.
  • Training simultaneously optimizes for answer correctness and for brevity in the reasoning path.
  • The same contrastive idea can be applied on top of existing RLVR pipelines without changing the reward model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The contrastive branch construction might be reusable for other meta-actions such as explicit backtracking or verification steps.
  • If the learned policy generalizes, downstream applications could operate with smaller context windows while retaining reasoning quality.
  • The method could be tested on tasks where the benefit of summarization is less obvious, such as short arithmetic chains.

Load-bearing premise

The entropy reduction and error-mitigation effects seen in pilot studies still appear when the same contrastive mechanism runs inside the complete RLVR loop on the main benchmarks.

What would settle it

Train an otherwise identical RLVR agent without the contrastive summarization branches and measure whether average rollout length and task accuracy remain unchanged on the evaluation suites.

Figures

Figures reproduced from arXiv: 2606.13316 by Hailang Huang, Pengkun Wang, Renda Li, Shidong Yang, Xiangxiang Chu, Xucong Wang, Yong Wang, Ziyu Ma.

Figure 1
Figure 1. Figure 1: Pilot studies. (a): Analysis of token entropy distributions near the summarization phrases. (b): Average accuracy of continuations of wrong rollouts truncated from different positions. To investigate this question, we conduct two pilot studies, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The proposed ReSum. (left): ReSum incentivizes the summarization behavior of the LLM during reasoning. (middle & right): the LLM generates initial rollouts for each query, then adopt the rollout branching at both Artifact Points (APs) and Natural Points (NPs). The advantage is calculated within branches with and without the summarization respectively. of the model’s instruction-following capability over ti… view at source ↗
Figure 3
Figure 3. Figure 3: The training and evaluation dynamics of the task reward. We compare ReSum and DGPO [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training and evaluation dynamics of output length with Qwen2.5-Math-7B. Branching Configuration Analysis. We exam￾ine how the rollout tree structure affects ReSum under fixed rollout budgets in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of accuracy across different difficulty levels of MATH500. Analysis of Accuracy across Problem Difficulty Levels [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces ReSum, an RLVR framework that enables LLMs to perform self-summarization to compress and organize reasoning trajectories. Motivated by pilot studies showing that self-summarization lowers token-level entropy and that a summarization phrase mitigates error propagation from incorrect prefixes, it proposes a summarization-aware adaptive rollout mechanism that creates contrastive branches (masking the phrase on spontaneous triggers, randomly injecting it otherwise) plus a summarization-aware advantage. The central empirical claim is an average 4% performance improvement and 18.6% rollout length reduction.

Significance. If the results hold and the gains can be attributed to the claimed mechanism, the work would be significant for RLVR research: it offers an internal, model-driven approach to managing long reasoning rollouts instead of external organizers, with potential benefits for coherence and context efficiency in long-horizon tasks.

major comments (1)
  1. [Abstract / pilot studies] Abstract / pilot studies description: the 4% gain and 18.6% length reduction are attributed to the contrastive rollout mechanism, which is justified by the pilot observations on entropy reduction and error mitigation. No analysis or ablation is described showing that these effects persist once the policy is updated inside the RLVR loop (different distribution, different optimization pressure), so the contrastive branches may not isolate the intended benefit and the reported improvements cannot be confidently attributed to the adaptive mechanism.
minor comments (1)
  1. [Abstract] Abstract: no information is given on baselines, number of runs, statistical tests, datasets, or tasks, which weakens the ability to evaluate the central empirical claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will revise the manuscript to strengthen the attribution of results to the proposed mechanism.

read point-by-point responses
  1. Referee: [Abstract / pilot studies] Abstract / pilot studies description: the 4% gain and 18.6% length reduction are attributed to the contrastive rollout mechanism, which is justified by the pilot observations on entropy reduction and error mitigation. No analysis or ablation is described showing that these effects persist once the policy is updated inside the RLVR loop (different distribution, different optimization pressure), so the contrastive branches may not isolate the intended benefit and the reported improvements cannot be confidently attributed to the adaptive mechanism.

    Authors: We agree that the pilot studies alone do not demonstrate persistence of the entropy and error-mitigation effects under the shifted distribution induced by RLVR updates. The pilots were intended only to motivate the design of the summarization-aware adaptive rollout and the contrastive branching procedure. The main experimental results reflect performance of the complete ReSum framework, in which the contrastive branches and summarization-aware advantage are active at every training step. To directly address the concern about isolation of the mechanism inside the training loop, we will add a new subsection containing (i) token-level entropy measurements on on-policy rollouts collected at multiple training checkpoints and (ii) an ablation that disables the contrastive masking/injection while keeping all other components fixed. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical outcomes independent of inputs

full rationale

The paper motivates its adaptive rollout mechanism from separate pilot observations on entropy and error propagation, then reports measured performance gains (4% average improvement, 18.6% length reduction) from full RLVR experiments on the resulting system. No equations, fitted parameters, or self-citations reduce the reported results to quantities defined by construction from the same inputs. The derivation chain consists of an empirical training procedure whose outputs are not tautological with its design choices.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full methods, equations, and experimental details unavailable. The framework rests on standard RLVR assumptions plus the new self-summarization behavior.

axioms (1)
  • domain assumption Existing RLVR methods often encourage unnecessarily long reasoning rollouts that degrade coherence
    Opening limitation stated in the abstract.

pith-pipeline@v0.9.1-grok · 5805 in / 1140 out tokens · 27545 ms · 2026-06-27T06:38:12.580085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 21 linked inside Pith

  1. [1]

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 4(5), 1

  2. [2]

    Bensal, U

    S. Bensal, U. Jamil, C. Bryant, M. Russak, K. Kamble, D. Mozolevskyi, M. Ali, and W. AlShikh. Reflect, retry, reward: Self-improving llms via reinforcement learning.arXiv preprint arXiv:2505.24726, 2025

  3. [3]

    Berton, J

    G. Berton, J. Unnikrishnan, S. Tran, and M. Shah. Compllm: Compression for long context q&a.arXiv preprint arXiv:2509.19228, 2025

  4. [4]

    J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. Xing, and L. Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523, 2021

  5. [5]

    L. Chen, L. Li, H. Zhao, and Y . Song. Vinci. r1-v: Reinforcing super generalization ability in vision- language models with less than $3, 2025

  6. [6]

    L. Chen, Z. Liu, W. He, Y . Zheng, H. Sun, Y . Li, R. Luo, and M. Yang. Long context is not long at all: A prospector of long-dependency data for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8222–8234, 2024

  7. [7]

    X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024

  8. [8]

    X. Chu, H. Huang, X. Zhang, F. Wei, and Y . Wang. GPG: A simple and strong reinforcement learning baseline for model reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

  9. [9]

    Y . Dai, Y . Ji, X. Zhang, Y . Wang, X. Chu, and Z. Lu. Harder is better: Boosting mathematical reasoning via difficulty-aware GRPO and multi-aspect question reformulation. InThe Fourteenth International Conference on Learning Representations, 2026

  10. [10]

    Y . Dai, J. Lian, Y . Huang, W. Zhang, M. Zhou, M. Wu, X. Xie, and H. Liao. Pretraining context compressor for large language models with embedding-based memory. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28715–28732, 2025

  11. [11]

    G. Dong, L. Bao, Z. Wang, K. Zhao, X. Li, J. Jin, J. Yang, H. Mao, F. Zhang, K. Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

  12. [12]

    G. Dong, H. Mao, K. Ma, L. Bao, Y . Chen, Z. Wang, Z. Chen, J. Du, H. Wang, F. Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

  13. [13]

    Z. Dong, J. Li, J. Jiang, M. Xu, W. X. Zhao, B. Wang, and W. Chen. Longred: Mitigating short-text degradation of long-context large language models via restoration distillation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10687–10707, 2025

  14. [14]

    H. Face. Open r1: A fully open reproduction of deepseek-r1, january 2025.URL https://github. com/huggingface/open-r1, 7, 2025

  15. [15]

    L. Feng, Z. Xue, T. Liu, and B. An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

  16. [16]

    J. Gao, W. Fu, M. Xie, S. Xu, C. He, Z. Mei, B. Zhu, and Y . Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

  17. [17]

    T. Ge, J. Hu, L. Wang, X. Wang, S.-Q. Chen, and F. Wei. In-context autoencoder for context compression in a large language model.arXiv preprint arXiv:2307.06945, 2023

  18. [18]

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  19. [19]

    R. Guo, Y . Liu, G. Ma, Y . Wang, Y . Zhang, L. Xia, K. Chen, Z. Sun, and D. Shi. When less is more: The llm scaling paradox in context compression.arXiv preprint arXiv:2602.09789, 2026. 10

  20. [20]

    C. He, R. Luo, Y . Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3828–3850, 2024

  21. [21]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  22. [22]

    C. R. C. Hooper, S. Kim, H. Mohammadzadeh, M. Maheswaran, S. Zhao, J. Paik, M. W. Mahoney, K. Keutzer, and A. Gholami. Squeezed attention: Accelerating long context length llm inference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32631–32652, 2025

  23. [23]

    Z. Hou, Z. Hu, Y . Li, R. Lu, J. Tang, and Y . Dong. Treerl: Llm reinforcement learning with on-policy tree search. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12355–12369, 2025

  24. [24]

    Y . Ji, Z. Ma, Y . Wang, G. Chen, X. Chu, and L. Wu. Tree search for llm agent reinforcement learning. arXiv preprint arXiv:2509.21240, 2025

  25. [25]

    B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  26. [26]

    H. Y . Koh, J. Ju, M. Liu, and S. Pan. An empirical survey on long document summarization: Datasets, models, and metrics.ACM computing surveys, 55(8):1–35, 2022

  27. [27]

    X. Lai, Z. Tian, Y . Chen, S. Yang, X. Peng, and J. Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629, 2024

  28. [28]

    Lewkowycz, A

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  29. [29]

    C. Li, X. Liu, Z. Zhang, S. Zhang, S. Liu, G. Ma, Y . Lan, and C. Shen. Upfront chain-of-thought: A cooperative framework for chain-of-thought compression.arXiv preprint arXiv:2510.08647, 2025

  30. [30]

    J. Li, X. Dong, Y . Liu, Z. Yang, Q. Wang, X. Wang, S.-C. Zhu, Z. Jia, and Z. Zheng. Reflectevo: Improving meta introspection of small llms by learning self-reflection. InFindings of the Association for Computational Linguistics: ACL 2025, pages 16948–16966, 2025

  31. [31]

    R. Li, H. Huang, F. Wei, F. Xiong, Y . Wang, and X. Chu. Adacurl: Adaptive curriculum reinforcement learning with invalid sample mitigation and historical revisiting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23123–23131, 2026

  32. [32]

    X. Li, G. Dong, J. Jin, Y . Zhang, Y . Zhou, Y . Zhu, P. Zhang, and Z. Dou. Search-o1: Agentic search- enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

  33. [33]

    Z. Li, Y . Hu, and W. Wang. Encouraging good processes without the need for good answers: Reinforcement learning for llm agent planning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1654–1666, 2025

  34. [34]

    Lightman, V

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  35. [35]

    Z. Ling, K. Liu, K. Yan, Y . Yang, W. Lin, T.-H. Fan, L. Shen, Z. Du, and J. Chen. Longreason: A synthetic long-context reasoning benchmark via context expansion.arXiv preprint arXiv:2501.15089, 2025

  36. [36]

    S. Liu, R. Li, L. Yu, L. Zhang, Z. Liu, and G. Jin. Badthink: Triggered overthinking attacks on chain- of-thought reasoning in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32141–32149, 2026

  37. [37]

    S. Liu, H. Yuan, M. Hu, Y . Li, Y . Chen, S. Liu, Z. Lu, and J. Jia. Rl-gpt: Integrating reinforcement learning and code-as-policy.Advances in Neural Information Processing Systems, 37:28430–28459, 2024

  38. [38]

    X. Liu, R. Zhao, P. Huang, C. Xiao, B. Li, J. Wang, T. Xiao, and J. Zhu. Forgetting curve: A reliable method for evaluating memorization capability for long-context models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4667–4682, 2024. 11

  39. [39]

    Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  40. [40]

    Y . Long, X. Wu, Y . Zhang, X. Wen, Y . Zhou, and S. Hong. Copy-paste to mitigate large language model hallucinations.arXiv preprint arXiv:2510.00508, 2025

  41. [41]

    Longpre, K

    S. Longpre, K. Perisetla, A. Chen, N. Ramesh, C. DuBois, and S. Singh. Entity-based knowledge conflicts in question answering. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7052–7063, 2021

  42. [42]

    Z. Lu, Y . Chai, Y . Guo, X. Yin, L. Liu, H. Wang, H. Xiao, S. Ren, P. Zhao, G. Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17608–17616, 2026

  43. [43]

    R. Luo, L. Wang, W. He, L. Chen, J. Li, and X. Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  44. [44]

    C. Ma, S. Yang, K. Huang, J. Lu, H. Meng, S. Wang, B. Ding, S. V osoughi, G. Wang, and J. Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

  45. [45]

    D. Ma, L. Chen, S. Zhang, Y . Miao, S. Zhu, Z. Chen, H. Xu, H. Li, S. Fan, L. Pan, et al. Compressing kv cache for long-context llm inference with inter-layer attention similarity. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 16407–16411. IEEE, 2026

  46. [46]

    Z. Ma, S. Yang, Y . Ji, X. Wang, Y . Wang, Y . Hu, T. Huang, and X. Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

  47. [47]

    X. Mai, H. Xu, Z.-Z. Li, W. Wang, J. Hu, Y . Zhang, W. Zhang, et al. Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773, 2025

  48. [48]

    the moon is made of marshmallows

    Y . Ming, S. Purushwalkam, S. Pandit, Z. Ke, X.-P. Nguyen, C. Xiong, and S. Joty. Faitheval: Can your language model stay faithful to context, even if" the moon is made of marshmallows".arXiv preprint arXiv:2410.03727, 2024

  49. [49]

    Muennighoff, Z

    N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. B. Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

  50. [50]

    D. Paul, M. Ismayilzada, M. Peyrard, B. Borges, A. Bosselut, R. West, and B. Faltings. Refiner: Reasoning feedback on intermediate representations. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1100–1126, 2024

  51. [51]

    Petrov, M

    A. Petrov, M. Sandler, A. Zhmoginov, N. Miller, and M. Vladymyrov. Long context in-context compression by getting to the gist of gisting.arXiv preprint arXiv:2504.08934, 2025

  52. [52]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  53. [53]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  54. [54]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  55. [55]

    Y . Shi, W. Yu, Z. Li, Y . Wang, H. Zhang, N. Liu, H. Mi, and D. Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

  56. [56]

    Srivastava, A

    G. Srivastava, A. Hussain, S. Srinivasan, and X. Wang. Do llms overthink basic math reasoning? bench- marking the accuracy-efficiency tradeoff in language models.arXiv preprint arXiv:2507.04023, 2025

  57. [57]

    Y . Tian, Z. Wang, Y . Peng, A. Yuan, Z. Wang, B. Yi, X. Liu, Y . Cui, and T. Yang. Keepkv: Achieving periodic lossless kv cache compression for efficient llm inference. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33259–33267, 2026. 12

  58. [58]

    L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

  59. [59]

    P. Wang, S. Xu, J. Li, Y . Luo, D. Li, J. Hao, and M. Zhang.Re2: Unlocking llm reasoning via reinforcement learning with re-solving.arXiv preprint arXiv:2603.07197, 2026

  60. [60]

    Z. Wang, X. Zheng, K. An, C. Ouyang, J. Cai, Y . Wang, and Y . Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

  61. [61]

    X. Wu, K. Li, Y . Zhao, L. Zhang, L. Ou, H. Yin, Z. Zhang, X. Yu, D. Zhang, Y . Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

  62. [62]

    Y . Wu, Y . Wang, Z. Ye, T. Du, S. Jegelka, and Y . Wang. When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025

  63. [63]

    Y . Xie, A. Goyal, W. Zheng, M.-Y . Kan, T. P. Lillicrap, K. Kawaguchi, and M. Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451, 2024

  64. [64]

    Xiong, H

    W. Xiong, H. Zhang, C. Ye, L. Chen, N. Jiang, and T. Zhang. Self-rewarding correction for mathematical reasoning.arXiv preprint arXiv:2502.19613, 2025

  65. [65]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. Technical report, 2024

  66. [66]

    A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

  67. [67]

    D. Yang, X. Han, Y . Gao, Y . Hu, S. Zhang, and H. Zhao. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3258–3270, 2024

  68. [68]

    Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  69. [69]

    X. Yuan, J. Zhang, K. Li, Z. Cai, L. Yao, J. Chen, E. Wang, Q. Hou, J. Chen, P.-T. Jiang, et al. En- hancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

  70. [70]

    Zhang and C

    J. Zhang and C. Zuo. Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathe- matical reasoning in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5642–5665, 2025

  71. [71]

    Zhang, C

    Q. Zhang, C. Hu, S. Upasani, B. Ma, F. Hong, V . Kamanuru, J. Rainton, C. Wu, M. Ji, H. Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025

  72. [72]

    Zhang, X

    W. Zhang, X. Li, K. Dong, Y . Wang, P. Jia, X. Li, Y . Zhang, D. Xu, Z. Du, H. Guo, et al. Process vs. outcome reward: Which is better for agentic rag reinforcement learning.arXiv preprint arXiv:2505.14069, 2025

  73. [73]

    find the time 1000 days later

    X. F. Zhang, A. Mohananey, A. Chronopoulou, P. Papalampidi, S. Gupta, T. Munkhdalai, L. Wang, and S. Upadhyay. Do llms really need 10+ thoughts for" find the time 1000 days later"? towards structural understanding of llm overthinking.arXiv preprint arXiv:2510.07880, 2025

  74. [74]

    X. Zhao, T. Xu, X. Wang, Z. Chen, D. Jin, L. Tan, Z. Yu, Z. Zhao, Y . He, S. Wang, et al. Boosting llm reasoning via spontaneous self-correction.arXiv preprint arXiv:2506.06923, 2025

  75. [75]

    Y . Zhao, W. Huang, S. Wang, R. Zhao, C. Chen, Y . Shu, and C. Qin. Training multi-turn search agent via contrastive dynamic branch sampling.arXiv preprint arXiv:2602.03719, 2026

  76. [76]

    Y . Zhao, Y . Liu, J. Liu, J. Chen, X. Wu, Y . Hao, T. Lv, S. Huang, L. Cui, Q. Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

  77. [77]

    alternatively

    C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y . Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 13 Appendix ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning The Appendix of the paper is organized as: • Appendix A: We give the proof of the proposition. • A...

  78. [78]

    Isolate one of the square roots. 2. Square both sides to eliminate the square root. 3. Simplify and isolate the remaining square root. 4. Square both sides again to eliminate the remaining square root. 5. Solve the resulting equation for x. 6. Verify the solution by substituting back into the original equation. Let’s implement this step-by-step in Python ...

  79. [79]

    Lety i =x i −1

    Variable substitution. Lety i =x i −1. Then0≤y i ≤3and the equation becomes y1 +y 2 +y 3 +y 4 = 6

  80. [80]

    By stars and bars, the total number of non-negative integer solutions is 6 + 4−1 4−1 = 9 3 = 84

    Unconstrained count. By stars and bars, the total number of non-negative integer solutions is 6 + 4−1 4−1 = 9 3 = 84

Showing first 80 references.