pith. sign in

arxiv: 2605.00298 · v1 · submitted 2026-04-30 · 💻 cs.LG · math.OC

Data Deletion Can Help in Adaptive RL

Pith reviewed 2026-05-07 04:43 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords deletioncontextdistributiondatadeploymentestimatortrainingenvironments
0
0 comments X

The pith

Random deletion of older training data after each round in contextual RL creates implicit decay on stale samples from mismatched distributions, cutting robustness gaps by 30% for MLPs and enabling smaller networks to beat larger ones without deletion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning agents often struggle when the world around them changes, like a robot moving from one room to another with different conditions. This paper studies the problem using contextual Markov Decision Processes, where each possible environment has a hidden context that affects how the agent should act. The usual way is to train one big policy that works for any context if you knew it, then train a separate estimator to guess the context from what the agent sees and does. The surprising finding is that after collecting data in rounds, randomly throwing away some of the older training examples helps the estimator guess better at test time. Why? Because the agent gets better over rounds, so early data comes from a worse policy and doesn't look like what you'll see when the policy is good. Deleting randomly gives newer data more influence without having to label which data is old. In tests, this cut the gap between training and real performance by 30 percent for basic neural networks and 6 percent for more complex recurrent ones. It even let a small network with five times fewer parameters beat a big network trained the normal way. To explain why, the authors look at a simpler math problem of fitting a model with regularization when training data doesn't match test data. They show that deleting one random training point can lower the expected error on test data, and for ridge regression they give exact conditions: it helps when the regularization is not too strong or too weak and when the signal is weak compared to noise, which relates to how big the mismatch is.

Core claim

randomly delete a fraction of the training buffer after each round. This works because data is collected across multiple rounds using progressively better policies, and older trajectories come from a different distribution than what the estimator will face at deployment time; random deletion creates an implicit exponential decay on older data while preserving diversity without requiring any explicit identification of which samples are stale. This reduces robustness gap by 30% for MLPs and by 6% on average for recurrent networks.

Load-bearing premise

That older trajectories systematically come from a meaningfully different distribution due to policy improvement, and that random deletion will create a beneficial implicit decay without excessive loss of useful information, as required for the idealized regularized ERM analysis to predict real gains.

read the original abstract

Deploying reinforcement learning policies in the real world requires adapting to time-varying environments. We study this problem in the contextual Markov Decision Process (cMDP) framework, where a family of environments is indexed by a low-dimensional context unknown at test time. The standard approach decomposes the problem: train a so-called "universal policy" which assumes knowledge of the true context, then pair it with a context estimator which approximates context using the observed trajectory. We identify a simple, counterintuitive trick that substantially improves the estimator: randomly delete a fraction of the training buffer after each round. This works because data is collected across multiple rounds using progressively better policies, and older trajectories come from a different distribution than what the estimator will face at deployment time; random deletion creates an implicit exponential decay on older data while preserving diversity without requiring any explicit identification of which samples are stale. This reduces robustness gap by 30% for MLPs and by 6% on average for recurrent networks. Strikingly, it allows a narrow MLP with 5x fewer parameters to outperform a wide MLP trained without deletion. To understand when and why deletion helps, we analyze regularized empirical risk minimization with a mismatch between the train distribution and the distribution at deployment; in this idealized setting, we prove that removing a single uniformly random training point decreases expected test loss in expectation under mild conditions. For ridge regression we make this quantitative: deletion helps when the regularization coefficient is moderate and the signal-to-noise ratio (SNR) is sufficiently low, and, crucially, this SNR threshold gives a direct measure of how large the distribution mismatch between training and deployment must be for deletion to be beneficial.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach depends on the domain assumption that policy improvement across rounds produces a systematic distribution shift between training data and deployment, plus standard ridge-regression assumptions of moderate regularization and low SNR for the quantitative benefit to hold.

free parameters (1)
  • deletion fraction
    The proportion of the buffer randomly removed after each round; chosen empirically to trade off decay against diversity preservation.
axioms (2)
  • domain assumption Data collected across rounds using progressively better policies produces older trajectories whose distribution differs from the deployment distribution faced by the estimator.
    Invoked to explain why random deletion creates a useful implicit exponential decay on stale samples.
  • standard math Random uniform deletion of training points decreases expected test loss under moderate regularization and sufficiently low SNR in regularized ERM.
    The idealized mathematical result that underpins the claim that deletion helps when distribution mismatch is large enough.

pith-pipeline@v0.9.0 · 5606 in / 1766 out tokens · 97756 ms · 2026-05-07T04:43:23.087429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.