Detecting Spiky Corruption in Markov Decision Processes
Pith reviewed 2026-05-25 12:15 UTC · model grok-4.3
The pith
If reward corruption in a CRMDP is sufficiently spiky, corrupt states can be detected so any standard RL algorithm learns the optimal policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a Spiky CRMDP the reward corruption satisfies a structural spikiness condition that renders the environment solvable. The regret bound is fully characterized, and a detection algorithm identifies the corrupt states, after which any common reinforcement learning algorithm can be applied to learn the optimal policy.
What carries the argument
The spikiness condition on reward corruption together with a state-detection algorithm that isolates corrupt states in CRMDPs.
If this is right
- Regret in Spiky CRMDPs is fully characterized and finite.
- Corrupt states can be isolated by the detection algorithm.
- Any standard RL algorithm applied after detection recovers the optimal policy.
- The approach works in simple gridworld environments with introduced corruption.
Where Pith is reading between the lines
- The same detection step could be inserted as a preprocessing stage before any existing RL codebase is run on noisy-reward data.
- If real-world reward signals contain concentrated errors, the method suggests a way to sanitize them without redesigning the learner.
- Extending the spikiness test to continuous state spaces would require checking whether the same concentration property still permits reliable detection.
Load-bearing premise
Reward corruption must be sufficiently spiky so that corrupt states become detectable and the regret bound remains tractable.
What would settle it
A CRMDP instance whose corruption violates the spikiness condition, in which the detection algorithm either misses corrupt states or the observed regret exceeds the claimed bound.
Figures
read the original abstract
Current reinforcement learning methods fail if the reward function is imperfect, i.e. if the agent observes reward different from what it actually receives. We study this problem within the formalism of Corrupt Reward Markov Decision Processes (CRMDPs). We show that if the reward corruption in a CRMDP is sufficiently "spiky", the environment is solvable. We fully characterize the regret bound of a Spiky CRMDP, and introduce an algorithm that is able to detect its corrupt states. We show that this algorithm can be used to learn the optimal policy with any common reinforcement learning algorithm. Finally, we investigate our algorithm in a pair of simple gridworld environments, finding that our algorithm can detect the corrupt states and learn the optimal policy despite the corruption.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies Corrupt Reward Markov Decision Processes (CRMDPs) in which observed rewards differ from true rewards. It claims that when reward corruption is sufficiently 'spiky', the CRMDP is solvable; the regret bound is fully characterized; an algorithm exists to detect corrupt states; and the detection step allows any standard RL algorithm to recover the optimal policy. The claims are supported by experiments on two simple gridworld environments.
Significance. If the spikiness condition holds, the work supplies a concrete detection mechanism that restores solvability to an otherwise intractable corruption model and supplies an explicit regret characterization. This is a targeted but useful contribution to robust RL, as it isolates a structural restriction under which standard algorithms can be made to work without modification.
minor comments (2)
- [Abstract] The abstract states that the regret bound of a Spiky CRMDP is 'fully characterized' but does not indicate whether the bound is parameter-free or depends on the spikiness parameters; a one-sentence clarification would help readers assess the strength of the theoretical result.
- [Experiments] The experimental section refers to 'a pair of simple gridworld environments' without specifying state-space size, transition structure, or the precise locations and magnitudes of the spiky corruptions; adding these details would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive summary and recommendation of minor revision. The assessment that the spikiness condition yields a concrete detection mechanism and explicit regret bound is consistent with the manuscript's contributions.
Circularity Check
No significant circularity; claims are conditional on explicit structural assumption
full rationale
The paper introduces the spikiness condition as an explicit restriction on reward corruption in CRMDPs and derives solvability, regret bounds, and a detection algorithm under that condition. No equations, fitted parameters, or self-citations appear in the provided text that would reduce the central results to their own inputs by construction. The derivation is self-contained because the key claims are stated as holding precisely when the spikiness assumption is satisfied, with no load-bearing step that renames a fit or imports uniqueness via prior self-work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Concrete Problems in AI Safety
[Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Man´ e. Concrete problems in AI safety. CoRR, abs/1606.06565,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
[Barto and Dietterich, 2004 ] A. G. Barto and T.G. Diet- terich. Reinforcement learning and its relationship to su- pervised learning. John Wiley and Sons, Inc,
work page 2004
-
[3]
Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei
[Christiano et al., 2017] Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems ,
work page 2017
-
[4]
Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S
[Davis et al., 2007] Jason V . Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information- theoretic metric learning. In Proceedings of the 24th In- ternational Conference on Machine Learning , ICML ’07, pages 209–216, New Y ork, NY , USA,
work page 2007
-
[5]
[Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja Giryes
ACM. [Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja Giryes. On the resistance of neural nets to label noise. CoRR, abs/1803.11410,
-
[6]
[Dwork et al., 2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Com- puter Science Conference,
work page 2012
-
[7]
Reinforcement learning with a corrupted reward channel
[Everitt et al., 2017] Tom Everitt, Victoria Krakovna, Lau- rent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In International Joint Confer- ence on Artificial Intelligence ,
work page 2017
-
[8]
Generalizing Skills with Semi-Supervised Reinforcement Learning
[Finn et al., 2016] Chelsea Finn, Tianhe Y u, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. arXiv preprint arXiv:1612.00429,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[9]
Safe exploration of state and action spaces in reinforcement learning
[Garc´ ıa and Fern´ andez, 2012] Javier Garc´ ıa and Fernando Fern´ andez. Safe exploration of state and action spaces in reinforcement learning. J. Artif. Int. Res. , 45(1):515–564, September
work page 2012
-
[10]
Roweis, 2005 ] Amir Globerson and Sam T
[Globerson and T. Roweis, 2005 ] Amir Globerson and Sam T. Roweis. Metric learning by collapsing classes. In Advances in Neural Information Processing Systems , vol- ume 18, 01
work page 2005
-
[11]
Dragan, Pieter Abbeel, and Stuart J
[Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, Anca D. Dragan, Pieter Abbeel, and Stuart J. Russell. Cooperative inverse reinforcement learning. CoRR, abs/1606.03137,
-
[12]
[Jain et al., 2009] Prateek Jain, Brian Kulis, Inderjit S. Dhillon, and Kristen Grauman. Online metric learning and fast similarity search. In D. Koller, D. Schuurmans, Y . Bengio, and L. Bottou, editors, Advances in Neural In- formation Processing Systems 21 , pages 761–768. Curran Associates, Inc.,
work page 2009
-
[13]
Al- gorithms for inverse reinforcement learning
[Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Al- gorithms for inverse reinforcement learning. In Icml, vol- ume 1, page 2,
work page 2000
-
[14]
[Riedl and Harrison, 2016 ] Mark O. Riedl and Brent Harri- son. Using stories to teach human values to artificial agents. In AI, Ethics, and Society, Papers from the 2016 AAAI W orkshop, Phoenix, Arizona, USA, February 13, 2016.,
work page 2016
-
[15]
Deep Learning is Robust to Massive Label Noise
[Rolnick et al., 2017] David Rolnick, Andreas V eit, Serge J. Belongie, and Nir Shavit. Deep learning is robust to mas- sive label noise. CoRR, abs/1705.10694,
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
[Santamar´ ıaet al., 1997] Juan Carlos Santamar´ ıa, Richard S. Sutton, and Ashwin Ram. Experiments with reinforce- ment learning in problems with continuous state and action spaces. Adaptive Behaviour, 6:163–217,
work page 1997
-
[17]
Proximal Policy Optimization Algorithms
[Schulman et al., 2017] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Taylor, Brian Kulis, and Fei Sha
[Taylor et al., 2011] Matthew E. Taylor, Brian Kulis, and Fei Sha. Metric learning for reinforcement learning agents. In AAMAS,
work page 2011
-
[19]
Safe exploration in finite markov decision processes with gaussian pro- cesses
[Turchetta et al., 2016] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite markov decision processes with gaussian pro- cesses. In Neural Information Processing Systems (NIPS) , pages 4305–4313, December
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.