Efficient RL Training for LLMs with Experience Replay
Pith reviewed 2026-05-10 17:09 UTC · model grok-4.3
The pith
A well-designed replay buffer can drastically cut inference compute in LLM RL post-training without degrading performance or policy entropy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Strict on-policy sampling becomes suboptimal once generation cost is taken into account. A replay buffer that trades off staleness-induced variance, sample diversity, and generation expense allows multiple uses of each rollout. This yields large reductions in inference compute while final model performance stays the same or improves and policy entropy is preserved, directly contradicting the prevailing belief that only fresh data works for LLM post-training.
What carries the argument
The experience replay buffer, which stores previous rollouts and re-samples them during training under a design that explicitly manages staleness variance, diversity, and generation cost.
If this is right
- The amount of new model generations required per training step can be reduced substantially.
- Final performance on downstream tasks can remain unchanged or increase.
- Policy entropy stays high, avoiding premature collapse.
- Strict on-policy sampling is no longer required once generation cost is high.
- Training runs can continue for more steps under a fixed inference budget.
Where Pith is reading between the lines
- Similar replay strategies might apply to any RL setting where sampling from the current policy is the dominant cost, such as large-scale simulation or expensive API calls.
- The same trade-off framing could guide hybrid data pipelines that mix on-policy and replay data in a compute-aware way.
- If the buffer design generalizes, it would make longer-horizon RL training feasible on current hardware without proportional increases in generation spend.
Load-bearing premise
It is possible to design a replay buffer that optimally balances staleness variance, sample diversity, and generation cost so that reusing data matches or beats strict on-policy sampling.
What would settle it
An experiment in which the replay buffer either produces measurably worse final task performance or causes a clear drop in policy entropy relative to the pure on-policy baseline would falsify the central claim.
read the original abstract
While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that experience replay buffers, when designed to trade off staleness-induced variance, sample diversity, and generation cost, make strict on-policy sampling suboptimal for LLM post-training. It argues that a well-designed replay buffer can drastically reduce inference compute while maintaining or improving final model performance and preserving policy entropy, supported by a systematic study and empirical results.
Significance. If the empirical claims hold after controlling for total generation budget and sample reuse, the work would be significant for efficient RL in LLMs, as it provides a concrete alternative to the dominant on-policy paradigm and formalizes a practical trade-off that could lower training costs without performance loss.
major comments (2)
- [Experiments] The central empirical claim that replay makes on-policy suboptimal requires experiments that match total tokens generated and number of gradient updates across replay and on-policy conditions. Without such controls, observed gains may reflect increased effective batch size or more updates per rollout rather than the claimed staleness-diversity-cost trade-off.
- [Method] The formalization of the optimal replay design as a three-way trade-off is stated in the abstract and introduction but lacks an explicit objective or derivation showing how buffer size, replay ratio, and sampling policy are chosen to minimize the combined cost; this makes the 'well-designed' qualifier hard to reproduce or falsify.
minor comments (2)
- [Method] Notation for replay ratio, buffer capacity, and staleness metric should be defined once in a dedicated subsection rather than introduced piecemeal.
- [Results] The abstract states that policy entropy is preserved, but the corresponding figure or table reporting entropy trajectories is not referenced in the text.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We agree that strengthening the experimental controls and providing a more explicit formalization will improve the manuscript. Below we respond point-by-point and describe the revisions we will make.
read point-by-point responses
-
Referee: [Experiments] The central empirical claim that replay makes on-policy suboptimal requires experiments that match total tokens generated and number of gradient updates across replay and on-policy conditions. Without such controls, observed gains may reflect increased effective batch size or more updates per rollout rather than the claimed staleness-diversity-cost trade-off.
Authors: We agree that a fair demonstration of suboptimality requires matched total generation budget and gradient updates. Our current experiments emphasize the practical regime in which generation is the dominant cost, showing that replay achieves comparable or better performance with substantially fewer tokens generated. To isolate the staleness-diversity-cost trade-off, we will add a new set of controlled experiments in the revision. These will fix the total number of tokens generated and the number of gradient steps across replay and on-policy runs by varying rollout frequency and effective batch size, with results reported in a new table and figure. revision: yes
-
Referee: [Method] The formalization of the optimal replay design as a three-way trade-off is stated in the abstract and introduction but lacks an explicit objective or derivation showing how buffer size, replay ratio, and sampling policy are chosen to minimize the combined cost; this makes the 'well-designed' qualifier hard to reproduce or falsify.
Authors: We acknowledge that the three-way trade-off is currently described at a conceptual level. While the design choices were informed by systematic empirical sweeps and prior RL literature on experience replay, we agree that an explicit objective would improve reproducibility. In the revision we will add a dedicated subsection that defines a combined cost function C = α·Var(staleness) + β·(1/Diversity) + γ·Generation_cost and describes the empirical procedure (grid search over buffer size and replay ratio, with sampling policy selected to maximize diversity under the cost constraint) used to identify the operating points reported in the paper. revision: yes
Circularity Check
No circularity; empirical claims rest on experiments, not self-referential derivations
full rationale
The paper conducts a systematic empirical study of replay buffers for LLM post-training, formalizing optimal design as a trade-off between staleness-induced variance, sample diversity, and generation cost, then showing via experiments that well-designed replay reduces inference compute without degrading performance. No equations, derivations, or first-principles results are presented that reduce any prediction to fitted parameters, self-citations, or inputs by construction. Central claims are supported by experimental observation rather than mathematical self-reference, and the work is self-contained against external benchmarks with no load-bearing self-citation chains or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Fresh on-policy data is essential for high performance in LLM post-training
Forward citations
Cited by 1 Pith paper
-
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[3]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.