ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

Guojun Yin; Jiajun Chai; Li Wang; Wei Lin; Xiaohan Wang; Xiaojun Guo; Zhexin Hu

arxiv: 2605.28069 · v1 · pith:56LDGNRPnew · submitted 2026-05-27 · 💻 cs.AI

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

Zhexin Hu , Li Wang , Xiaohan Wang , Jiajun Chai , Xiaojun Guo , Wei Lin , Guojun Yin This is my paper

Pith reviewed 2026-06-29 12:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords adaptive context compressionhindsight response replayreinforcement learning from verifiable rewardsmulti-turn agent taskstoken efficiencylong-horizon RL

0 comments

The pith

ZipRL improves multi-turn agent task performance through adaptive non-uniform context compression and hindsight response replay during reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZipRL as a framework for compressing context in long-horizon LLM agent workflows that face sparse rewards. It pairs a multi-granularity compression process with Hindsight Response Replay to increase the density of useful training signals inside the RL optimization loop. The authors prove that this non-uniform approach retains more task-relevant information than uniform compression and report large gains on five agent benchmarks along with strong token efficiency and stability when sequences reach 256 turns. A reader would care because growing context lengths in agent systems currently force trade-offs between memory use and decision quality.

Core claim

ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

What carries the argument

Multi-granularity compression mechanism combined with Hindsight Response Replay (HRR) that reshapes advantages inside GRPO to densify RL training signals.

If this is right

The method delivers 27.9 percent and 34.7 percent higher scores than prior approaches on the five agent tasks while using fewer tokens.
Performance remains stable when test sequences are extended to 256 turns.
Coarse-to-fine prompting plus advantage reshaping produces higher task-relevant utility than uniform compression.
The framework works across models of different sizes without additional task-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same replay idea could be tested inside other RL algorithms that also suffer from sparse long-horizon rewards.
If non-uniform compression preserves utility better than uniform methods in this setting, similar granularity choices may help in retrieval-augmented generation pipelines.
The reported robustness at 256 turns suggests the approach may scale to even longer agent sessions before context limits are reached.

Load-bearing premise

The assumption that hindsight response replay densifies training signals in GRPO without introducing bias or instability that would invalidate the reported performance gains on the agent benchmarks.

What would settle it

Running the same five agent tasks with the HRR component removed and finding that the reported performance margins over prior methods disappear would falsify the claim that the replay technique is responsible for the gains.

Figures

Figures reproduced from arXiv: 2605.28069 by Guojun Yin, Jiajun Chai, Li Wang, Wei Lin, Xiaohan Wang, Xiaojun Guo, Zhexin Hu.

**Figure 2.** Figure 2: Overview of the ZipRL framework, including the Multi-Granularity Mechanism, Compression Quality [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Training dynamics of ZipRL. (Left) Average [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: (Left) Average token count variation over [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Compression ratio score versus text length. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of retrieval top-k (left) and training turns (right) on average EM and F1 scores across five datasets. top-k is set to 20, which we adopt as the default parameter. N.3 Training Turns Analysis To investigate the impact of the maximum number of interaction turns on ZipRL’s performance, we evaluate the model at maximum training turns of 5, 20, and 30. As shown in [PITH_FULL_IMAGE:figures/full_fig_p016… view at source ↗

**Figure 8.** Figure 8: Distribution of active trajectories across differ [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt Design for ZipRL, Part 1. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt Design for ZipRL, Part 2. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZipRL pairs multi-granularity compression with hindsight replay inside GRPO for RLVR agent tasks and reports clear gains, but the advantage-reshaping step needs checking for bias.

read the letter

ZipRL's main move is to add Hindsight Response Replay to GRPO so that hindsight responses reshape advantages and densify the training signal for a compression policy under sparse rewards. The multi-granularity setup (coarse-to-fine prompts for macro compression) plus the 256-turn extrapolation test are the concrete pieces that stand out from the abstract.

The approach does a few things cleanly. Treating compression as an RLVR problem with verifiable rewards is a direct fit for agent workflows. The claim of better task-relevant utility than uniform methods is stated as a proof, and the reported 27.9 % and 34.7 % lifts on Qwen3-4B and 8B across five tasks, together with token efficiency, give a usable empirical target.

The soft spot is exactly the one the stress-test flags. Generalized advantage reshaping via HRR can correlate the advantage estimate with the hindsight distribution rather than the true return; if that happens the objective becomes biased and the gains could be partly an artifact. The abstract does not describe an ablation that isolates HRR from the compression policy, nor does it show whether the theoretical proof addresses this RL-specific bias. Baseline token-budget matching and error-bar details are also not visible yet.

The paper is aimed at people who already run long-horizon agents and need practical context compression. Anyone working on RLVR or context management will want to see the full experiments and the proof.

Send it to peer review. The integration is new enough and the benchmarks are sharp enough that referees should look at it, even if the bias question requires extra work in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces ZipRL, an adaptive context compression framework for RLVR in LLMs aimed at multi-turn agent tasks. It proposes a multi-granularity compression mechanism and Hindsight Response Replay (HRR) integrated into GRPO through generalized advantage reshaping. The authors claim a theoretical proof that ZipRL has superior task-relevant utility compared to uniform methods, and report empirical results where ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% on Qwen3-4B and Qwen3-8B models across five agent tasks, while showing good token efficiency and robustness in 256-turn extrapolation tests.

Significance. If the results hold, this would represent a significant advance in enabling efficient long-horizon agent workflows with LLMs by improving context management in RL settings. The combination of theoretical analysis and empirical validation on multiple model scales is a strength. The stress tests for extrapolation are particularly notable. However, the significance is tempered by the need to verify that the HRR technique does not introduce bias in the advantage estimates.

major comments (3)

[Abstract] Abstract: The reported performance gains of 27.9% and 34.7% are presented without specifying the exact metric (e.g., success rate, reward), how baselines were controlled for token budget, or the computation method for the percentages. This is central to the empirical claim.
[Theoretical proof] Theoretical proof: The proof of superior task-relevant utility does not address whether the generalized advantage reshaping in HRR preserves unbiased policy gradients or introduces correlation with the hindsight response distribution, which could bias the optimization and undermine the attribution of gains to the compression method.
[Experimental results] Experimental results: No ablation study isolating HRR from the multi-granularity compression is described, making it impossible to determine if the gains are due to the proposed HRR or other factors. This is load-bearing for validating the core contribution.

minor comments (2)

[Abstract] Abstract: The term 'GRPO' is used without expansion on first use; it should be defined at first mention.
[Full text] The manuscript would benefit from more details on the five agent tasks and the exact baselines used for comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The reported performance gains of 27.9% and 34.7% are presented without specifying the exact metric (e.g., success rate, reward), how baselines were controlled for token budget, or the computation method for the percentages. This is central to the empirical claim.

Authors: We agree that additional detail is needed. The reported gains are relative improvements in task success rate. All baselines were evaluated under identical average token budgets as ZipRL for fair comparison. Percentages are computed as (ZipRL success rate - baseline success rate) / baseline success rate. We will revise the abstract to explicitly include these specifications. revision: yes
Referee: [Theoretical proof] Theoretical proof: The proof of superior task-relevant utility does not address whether the generalized advantage reshaping in HRR preserves unbiased policy gradients or introduces correlation with the hindsight response distribution, which could bias the optimization and undermine the attribution of gains to the compression method.

Authors: The existing proof establishes superior task-relevant utility for the multi-granularity compression mechanism relative to uniform compression and is independent of the RL optimizer details. HRR is incorporated via generalized advantage reshaping within GRPO; this reshaping uses hindsight responses solely for advantage estimation in a manner that preserves the expectation of the policy gradient (i.e., no systematic bias is introduced). We will add an explicit paragraph in the theoretical analysis section discussing gradient unbiasedness under HRR and addressing potential correlation concerns. revision: yes
Referee: [Experimental results] Experimental results: No ablation study isolating HRR from the multi-granularity compression is described, making it impossible to determine if the gains are due to the proposed HRR or other factors. This is load-bearing for validating the core contribution.

Authors: We acknowledge that isolating the contribution of HRR would strengthen the paper. In the revised manuscript we will add an ablation study that fixes the multi-granularity compression and varies the presence of HRR, reporting the incremental performance impact across the five agent tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The abstract and description present a theoretical proof of superior task-relevant utility for ZipRL over uniform compression, plus empirical gains from multi-granularity compression and HRR integrated into GRPO. No equations, self-citations, or fitted parameters are shown that reduce the claimed utility proof, advantage reshaping, or benchmark gains to inputs by construction. The central claims rest on independent theoretical argument and external agent-task benchmarks rather than self-referential definitions or renamed fits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the method description is too high-level to extract concrete modeling choices or unstated assumptions.

pith-pipeline@v0.9.1-grok · 5766 in / 1155 out tokens · 16708 ms · 2026-06-29T12:37:49.151518+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl. arXiv preprint arXiv:2508.07976. Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. 2023. In-context autoencoder for con- text compression in a large language model.arXiv preprint arXiv:2307.06945. Shuyu Guo, Shuo Zhang, and Zhaochun Ren. 2025. Enhancin...

work page arXiv 2023
[2]

InInternational conference on machine learning, pages 4344–4353

Learning by playing solving sparse reward tasks from scratch. InInternational conference on machine learning, pages 4344–4353. PMLR. Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba
[3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Meminsight: Autonomous memory augmen- tation for llm agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 33124–33140. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematica...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629. Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xi...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei- Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou

Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699. Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei- Ying Ma, Jingjing Liu, Mingxuan Wang, and 1 others

work page arXiv
[6]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259. Shijie Zhang, Guohao Sun, Kevin Zhang, Xiang Guo, and Rujun Guo. 2025a. Clpo: Curriculum learning meets policy optimization for llm reasoning.arXiv preprint arXiv:2509.25004. Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfis- ter, Rui Zhang, and Serca...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

what goalwouldthis trajectory have achieved?

and synthesizing trajectories via GPT-4o. To ensure high quality, we applied two filters: (1)Cor- rectness: retaining only correct trajectories, and (2) Format: discarding outputs with structural viola- tions. This yielded 1,155 valid trajectories, which were then decomposed into transition-level sam- ples. The model is optimized using the standard Superv...

work page arXiv 2017
[8]

Since the final Qcom is a weighted sum, the penalty in structural and seman- tic dimensions heavily outweighs the illicit gain in Qinfo

Mutual Structural Constraints.While gener- ating redundant query keywords might marginally increase Qinfo, this behavior is strictly penalized by Qratio (length overflow) and Qsem (destroyed grammatical coherence). Since the final Qcom is a weighted sum, the penalty in structural and seman- tic dimensions heavily outweighs the illicit gain in Qinfo
[9]

Mathematical Bounding.The keyword cov- erage term Skey =|K q ∩y|/(|K q ∩x|+ϵ) naturally caps at 1.0, preventing unbounded reward through keyword repetition
[10]

The final advantage remains fundamentally anchored by the GRPO advantage, which is strictly determined by the exact match (EM/F1) of the final answer

Anchoring by Final Task Reward.HRR functions as an advantage reshaping technique rather than replacing the environment reward. The final advantage remains fundamentally anchored by the GRPO advantage, which is strictly determined by the exact match (EM/F1) of the final answer. This is empirically confirmed by the monotonically increasing Pearson correlati...

2000

[1] [1]

Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl. arXiv preprint arXiv:2508.07976. Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. 2023. In-context autoencoder for con- text compression in a large language model.arXiv preprint arXiv:2307.06945. Shuyu Guo, Shuo Zhang, and Zhaochun Ren. 2025. Enhancin...

work page arXiv 2023

[2] [2]

InInternational conference on machine learning, pages 4344–4353

Learning by playing solving sparse reward tasks from scratch. InInternational conference on machine learning, pages 4344–4353. PMLR. Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba

[3] [3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Meminsight: Autonomous memory augmen- tation for llm agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 33124–33140. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematica...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629. Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xi...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei- Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou

Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699. Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei- Ying Ma, Jingjing Liu, Mingxuan Wang, and 1 others

work page arXiv

[6] [6]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259. Shijie Zhang, Guohao Sun, Kevin Zhang, Xiang Guo, and Rujun Guo. 2025a. Clpo: Curriculum learning meets policy optimization for llm reasoning.arXiv preprint arXiv:2509.25004. Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfis- ter, Rui Zhang, and Serca...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

what goalwouldthis trajectory have achieved?

and synthesizing trajectories via GPT-4o. To ensure high quality, we applied two filters: (1)Cor- rectness: retaining only correct trajectories, and (2) Format: discarding outputs with structural viola- tions. This yielded 1,155 valid trajectories, which were then decomposed into transition-level sam- ples. The model is optimized using the standard Superv...

work page arXiv 2017

[8] [8]

Since the final Qcom is a weighted sum, the penalty in structural and seman- tic dimensions heavily outweighs the illicit gain in Qinfo

Mutual Structural Constraints.While gener- ating redundant query keywords might marginally increase Qinfo, this behavior is strictly penalized by Qratio (length overflow) and Qsem (destroyed grammatical coherence). Since the final Qcom is a weighted sum, the penalty in structural and seman- tic dimensions heavily outweighs the illicit gain in Qinfo

[9] [9]

Mathematical Bounding.The keyword cov- erage term Skey =|K q ∩y|/(|K q ∩x|+ϵ) naturally caps at 1.0, preventing unbounded reward through keyword repetition

[10] [10]

The final advantage remains fundamentally anchored by the GRPO advantage, which is strictly determined by the exact match (EM/F1) of the final answer

Anchoring by Final Task Reward.HRR functions as an advantage reshaping technique rather than replacing the environment reward. The final advantage remains fundamentally anchored by the GRPO advantage, which is strictly determined by the exact match (EM/F1) of the final answer. This is empirically confirmed by the monotonically increasing Pearson correlati...

2000