Efficient RL Training for LLMs with Experience Replay

Charles Arnal; Julia Kempe; Remi Munos; Taco Cohen; Vivien Cabannes

arxiv: 2604.08706 · v1 · submitted 2026-04-09 · 💻 cs.LG

Efficient RL Training for LLMs with Experience Replay

Charles Arnal , Vivien Cabannes , Taco Cohen , Julia Kempe , Remi Munos This is my paper

Pith reviewed 2026-05-10 17:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords experience replayreinforcement learningLLM post-traininginference efficiencyon-policy samplingpolicy entropyreplay buffer design

0 comments

The pith

A well-designed replay buffer can drastically cut inference compute in LLM RL post-training without degrading performance or policy entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the common assumption that LLM reinforcement learning requires fresh on-policy samples at every step to reach high performance. It formalizes replay buffer design as a balance between staleness that adds variance, the benefits of sample diversity, and the high cost of generating new model outputs. Experiments show that reusing stored rollouts can reduce the number of expensive generations needed while achieving equal or better final results and keeping entropy stable. A sympathetic reader would care because generation dominates compute budgets in scaling RL for language models, so cheaper data reuse could enable longer or larger-scale training runs.

Core claim

Strict on-policy sampling becomes suboptimal once generation cost is taken into account. A replay buffer that trades off staleness-induced variance, sample diversity, and generation expense allows multiple uses of each rollout. This yields large reductions in inference compute while final model performance stays the same or improves and policy entropy is preserved, directly contradicting the prevailing belief that only fresh data works for LLM post-training.

What carries the argument

The experience replay buffer, which stores previous rollouts and re-samples them during training under a design that explicitly manages staleness variance, diversity, and generation cost.

If this is right

The amount of new model generations required per training step can be reduced substantially.
Final performance on downstream tasks can remain unchanged or increase.
Policy entropy stays high, avoiding premature collapse.
Strict on-policy sampling is no longer required once generation cost is high.
Training runs can continue for more steps under a fixed inference budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar replay strategies might apply to any RL setting where sampling from the current policy is the dominant cost, such as large-scale simulation or expensive API calls.
The same trade-off framing could guide hybrid data pipelines that mix on-policy and replay data in a compute-aware way.
If the buffer design generalizes, it would make longer-horizon RL training feasible on current hardware without proportional increases in generation spend.

Load-bearing premise

It is possible to design a replay buffer that optimally balances staleness variance, sample diversity, and generation cost so that reusing data matches or beats strict on-policy sampling.

What would settle it

An experiment in which the replay buffer either produces measurably worse final task performance or causes a clear drop in policy entropy relative to the pure on-policy baseline would falsify the central claim.

read the original abstract

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that experience replay buffers, when designed to trade off staleness-induced variance, sample diversity, and generation cost, make strict on-policy sampling suboptimal for LLM post-training. It argues that a well-designed replay buffer can drastically reduce inference compute while maintaining or improving final model performance and preserving policy entropy, supported by a systematic study and empirical results.

Significance. If the empirical claims hold after controlling for total generation budget and sample reuse, the work would be significant for efficient RL in LLMs, as it provides a concrete alternative to the dominant on-policy paradigm and formalizes a practical trade-off that could lower training costs without performance loss.

major comments (2)

[Experiments] The central empirical claim that replay makes on-policy suboptimal requires experiments that match total tokens generated and number of gradient updates across replay and on-policy conditions. Without such controls, observed gains may reflect increased effective batch size or more updates per rollout rather than the claimed staleness-diversity-cost trade-off.
[Method] The formalization of the optimal replay design as a three-way trade-off is stated in the abstract and introduction but lacks an explicit objective or derivation showing how buffer size, replay ratio, and sampling policy are chosen to minimize the combined cost; this makes the 'well-designed' qualifier hard to reproduce or falsify.

minor comments (2)

[Method] Notation for replay ratio, buffer capacity, and staleness metric should be defined once in a dedicated subsection rather than introduced piecemeal.
[Results] The abstract states that policy entropy is preserved, but the corresponding figure or table reporting entropy trajectories is not referenced in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that strengthening the experimental controls and providing a more explicit formalization will improve the manuscript. Below we respond point-by-point and describe the revisions we will make.

read point-by-point responses

Referee: [Experiments] The central empirical claim that replay makes on-policy suboptimal requires experiments that match total tokens generated and number of gradient updates across replay and on-policy conditions. Without such controls, observed gains may reflect increased effective batch size or more updates per rollout rather than the claimed staleness-diversity-cost trade-off.

Authors: We agree that a fair demonstration of suboptimality requires matched total generation budget and gradient updates. Our current experiments emphasize the practical regime in which generation is the dominant cost, showing that replay achieves comparable or better performance with substantially fewer tokens generated. To isolate the staleness-diversity-cost trade-off, we will add a new set of controlled experiments in the revision. These will fix the total number of tokens generated and the number of gradient steps across replay and on-policy runs by varying rollout frequency and effective batch size, with results reported in a new table and figure. revision: yes
Referee: [Method] The formalization of the optimal replay design as a three-way trade-off is stated in the abstract and introduction but lacks an explicit objective or derivation showing how buffer size, replay ratio, and sampling policy are chosen to minimize the combined cost; this makes the 'well-designed' qualifier hard to reproduce or falsify.

Authors: We acknowledge that the three-way trade-off is currently described at a conceptual level. While the design choices were informed by systematic empirical sweeps and prior RL literature on experience replay, we agree that an explicit objective would improve reproducibility. In the revision we will add a dedicated subsection that defines a combined cost function C = α·Var(staleness) + β·(1/Diversity) + γ·Generation_cost and describes the empirical procedure (grid search over buffer size and replay ratio, with sampling policy selected to maximize diversity under the cost constraint) used to identify the operating points reported in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on experiments, not self-referential derivations

full rationale

The paper conducts a systematic empirical study of replay buffers for LLM post-training, formalizing optimal design as a trade-off between staleness-induced variance, sample diversity, and generation cost, then showing via experiments that well-designed replay reduces inference compute without degrading performance. No equations, derivations, or first-principles results are presented that reduce any prediction to fitted parameters, self-citations, or inputs by construction. Central claims are supported by experimental observation rather than mathematical self-reference, and the work is self-contained against external benchmarks with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that generation cost dominates and that replay can be tuned without introducing unmanageable bias; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Fresh on-policy data is essential for high performance in LLM post-training
This is the prevailing belief explicitly challenged by the work.

pith-pipeline@v0.9.0 · 5426 in / 1068 out tokens · 38131 ms · 2026-05-10T17:09:02.348663+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
cs.LG 2026-05 unverdicted novelty 6.0

Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[3]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[3] [3]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page