Detecting Spiky Corruption in Markov Decision Processes

Alok Singh; David Lindner; Jason Mancuso; Tomasz Kisielewski

arxiv: 1907.00452 · v1 · pith:2L2IAQMInew · submitted 2019-06-30 · 💻 cs.LG · stat.ML

Detecting Spiky Corruption in Markov Decision Processes

Jason Mancuso , Tomasz Kisielewski , David Lindner , Alok Singh This is my paper

Pith reviewed 2026-05-25 12:15 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords reinforcement learningcorrupt rewardsMarkov decision processesCRMDPregret boundsreward corruptionspiky corruptionstate detection

0 comments

The pith

If reward corruption in a CRMDP is sufficiently spiky, corrupt states can be detected so any standard RL algorithm learns the optimal policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning breaks when observed rewards differ from true rewards received. The paper studies this in Corrupt Reward Markov Decision Processes. It proves that spiky corruption, where mismatches concentrate in identifiable patterns, makes the setting solvable. A detection algorithm locates the corrupt states, after which any common RL method can recover the optimal policy. The paper also gives a complete characterization of the resulting regret bounds and tests the approach in gridworlds.

Core claim

In a Spiky CRMDP the reward corruption satisfies a structural spikiness condition that renders the environment solvable. The regret bound is fully characterized, and a detection algorithm identifies the corrupt states, after which any common reinforcement learning algorithm can be applied to learn the optimal policy.

What carries the argument

The spikiness condition on reward corruption together with a state-detection algorithm that isolates corrupt states in CRMDPs.

If this is right

Regret in Spiky CRMDPs is fully characterized and finite.
Corrupt states can be isolated by the detection algorithm.
Any standard RL algorithm applied after detection recovers the optimal policy.
The approach works in simple gridworld environments with introduced corruption.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same detection step could be inserted as a preprocessing stage before any existing RL codebase is run on noisy-reward data.
If real-world reward signals contain concentrated errors, the method suggests a way to sanitize them without redesigning the learner.
Extending the spikiness test to continuous state spaces would require checking whether the same concentration property still permits reliable detection.

Load-bearing premise

Reward corruption must be sufficiently spiky so that corrupt states become detectable and the regret bound remains tractable.

What would settle it

A CRMDP instance whose corruption violates the spikiness condition, in which the detection algorithm either misses corrupt states or the observed regret exceeds the claimed bound.

Figures

Figures reproduced from arXiv: 1907.00452 by Alok Singh, David Lindner, Jason Mancuso, Tomasz Kisielewski.

**Figure 1.** Figure 1: Toy Spiky CRMDP envorinments Corners on the left, OnTheWay on the right. The blue cell in the lower right hand corner is the starting position of the agent and the green cell in the upper left hand corner the goal. The true reward collected in each state is determined by the max-distance to the goal and shown here by the numbers in each cell. The reward in the red cells is corrupted and the underlying tru… view at source ↗

read the original abstract

Current reinforcement learning methods fail if the reward function is imperfect, i.e. if the agent observes reward different from what it actually receives. We study this problem within the formalism of Corrupt Reward Markov Decision Processes (CRMDPs). We show that if the reward corruption in a CRMDP is sufficiently "spiky", the environment is solvable. We fully characterize the regret bound of a Spiky CRMDP, and introduce an algorithm that is able to detect its corrupt states. We show that this algorithm can be used to learn the optimal policy with any common reinforcement learning algorithm. Finally, we investigate our algorithm in a pair of simple gridworld environments, finding that our algorithm can detect the corrupt states and learn the optimal policy despite the corruption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a spiky subclass of corrupted-reward MDPs, gives a detection algorithm that flags bad states, and shows standard RL then recovers the policy on toy gridworlds.

read the letter

The core result is that when reward corruption is spiky enough, you can detect the affected states and hand the cleaned problem to any ordinary RL algorithm. They also supply a regret bound that holds under this restriction inside the CRMDP model. That is the new piece relative to earlier corrupted-reward work: a concrete detection step rather than just analysis of the corrupted setting. The gridworld runs confirm the detector finds the spikes and the policy comes out correct afterward. The experiments are small but they match the claim they are testing. The main limit is the spikiness assumption itself. If corruption is diffuse or gradual instead of concentrated spikes, the detection guarantee does not apply and the regret bound does not hold. The paper states this condition clearly, so there is no hidden circularity. The math and the algorithm rest on that structural restriction, and the experiments stay inside environments where the assumption is easy to satisfy. This is useful for researchers already looking at robust or safe RL who want a handle on one concrete corruption pattern. It is not a general solution for arbitrary reward tampering. A serious referee should see it because the detection method is explicit, the regret claim is stated, and the experiments are reproducible on the reported domains.

Referee Report

0 major / 2 minor

Summary. The paper studies Corrupt Reward Markov Decision Processes (CRMDPs) in which observed rewards differ from true rewards. It claims that when reward corruption is sufficiently 'spiky', the CRMDP is solvable; the regret bound is fully characterized; an algorithm exists to detect corrupt states; and the detection step allows any standard RL algorithm to recover the optimal policy. The claims are supported by experiments on two simple gridworld environments.

Significance. If the spikiness condition holds, the work supplies a concrete detection mechanism that restores solvability to an otherwise intractable corruption model and supplies an explicit regret characterization. This is a targeted but useful contribution to robust RL, as it isolates a structural restriction under which standard algorithms can be made to work without modification.

minor comments (2)

[Abstract] The abstract states that the regret bound of a Spiky CRMDP is 'fully characterized' but does not indicate whether the bound is parameter-free or depends on the spikiness parameters; a one-sentence clarification would help readers assess the strength of the theoretical result.
[Experiments] The experimental section refers to 'a pair of simple gridworld environments' without specifying state-space size, transition structure, or the precise locations and magnitudes of the spiky corruptions; adding these details would improve reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. The assessment that the spikiness condition yields a concrete detection mechanism and explicit regret bound is consistent with the manuscript's contributions.

Circularity Check

0 steps flagged

No significant circularity; claims are conditional on explicit structural assumption

full rationale

The paper introduces the spikiness condition as an explicit restriction on reward corruption in CRMDPs and derives solvability, regret bounds, and a detection algorithm under that condition. No equations, fitted parameters, or self-citations appear in the provided text that would reduce the central results to their own inputs by construction. The derivation is self-contained because the key claims are stated as holding precisely when the spikiness assumption is satisfied, with no load-bearing step that renames a fit or imports uniqueness via prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the spikiness condition itself functions as an unelaborated domain assumption.

pith-pipeline@v0.9.0 · 5654 in / 1169 out tokens · 24825 ms · 2026-05-25T12:15:10.718834+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 4 internal anchors

[1]

Concrete Problems in AI Safety

[Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Man´ e. Concrete problems in AI safety. CoRR, abs/1606.06565,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

[Barto and Dietterich, 2004 ] A. G. Barto and T.G. Diet- terich. Reinforcement learning and its relationship to su- pervised learning. John Wiley and Sons, Inc,

work page 2004
[3]

Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

[Christiano et al., 2017] Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems ,

work page 2017
[4]

Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S

[Davis et al., 2007] Jason V . Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information- theoretic metric learning. In Proceedings of the 24th In- ternational Conference on Machine Learning , ICML ’07, pages 209–216, New Y ork, NY , USA,

work page 2007
[5]

[Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja Giryes

ACM. [Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja Giryes. On the resistance of neural nets to label noise. CoRR, abs/1803.11410,

work page arXiv 2018
[6]

Fairness through awareness

[Dwork et al., 2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Com- puter Science Conference,

work page 2012
[7]

Reinforcement learning with a corrupted reward channel

[Everitt et al., 2017] Tom Everitt, Victoria Krakovna, Lau- rent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In International Joint Confer- ence on Artiﬁcial Intelligence ,

work page 2017
[8]

Generalizing Skills with Semi-Supervised Reinforcement Learning

[Finn et al., 2016] Chelsea Finn, Tianhe Y u, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. arXiv preprint arXiv:1612.00429,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

Safe exploration of state and action spaces in reinforcement learning

[Garc´ ıa and Fern´ andez, 2012] Javier Garc´ ıa and Fernando Fern´ andez. Safe exploration of state and action spaces in reinforcement learning. J. Artif. Int. Res. , 45(1):515–564, September

work page 2012
[10]

Roweis, 2005 ] Amir Globerson and Sam T

[Globerson and T. Roweis, 2005 ] Amir Globerson and Sam T. Roweis. Metric learning by collapsing classes. In Advances in Neural Information Processing Systems , vol- ume 18, 01

work page 2005
[11]

Dragan, Pieter Abbeel, and Stuart J

[Hadﬁeld-Menell et al., 2016] Dylan Hadﬁeld-Menell, Anca D. Dragan, Pieter Abbeel, and Stuart J. Russell. Cooperative inverse reinforcement learning. CoRR, abs/1606.03137,

work page arXiv 2016
[12]

Dhillon, and Kristen Grauman

[Jain et al., 2009] Prateek Jain, Brian Kulis, Inderjit S. Dhillon, and Kristen Grauman. Online metric learning and fast similarity search. In D. Koller, D. Schuurmans, Y . Bengio, and L. Bottou, editors, Advances in Neural In- formation Processing Systems 21 , pages 761–768. Curran Associates, Inc.,

work page 2009
[13]

Al- gorithms for inverse reinforcement learning

[Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Al- gorithms for inverse reinforcement learning. In Icml, vol- ume 1, page 2,

work page 2000
[14]

Riedl and Brent Harri- son

[Riedl and Harrison, 2016 ] Mark O. Riedl and Brent Harri- son. Using stories to teach human values to artiﬁcial agents. In AI, Ethics, and Society, Papers from the 2016 AAAI W orkshop, Phoenix, Arizona, USA, February 13, 2016.,

work page 2016
[15]

Deep Learning is Robust to Massive Label Noise

[Rolnick et al., 2017] David Rolnick, Andreas V eit, Serge J. Belongie, and Nir Shavit. Deep learning is robust to mas- sive label noise. CoRR, abs/1705.10694,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Sutton, and Ashwin Ram

[Santamar´ ıaet al., 1997] Juan Carlos Santamar´ ıa, Richard S. Sutton, and Ashwin Ram. Experiments with reinforce- ment learning in problems with continuous state and action spaces. Adaptive Behaviour, 6:163–217,

work page 1997
[17]

Proximal Policy Optimization Algorithms

[Schulman et al., 2017] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Taylor, Brian Kulis, and Fei Sha

[Taylor et al., 2011] Matthew E. Taylor, Brian Kulis, and Fei Sha. Metric learning for reinforcement learning agents. In AAMAS,

work page 2011
[19]

Safe exploration in ﬁnite markov decision processes with gaussian pro- cesses

[Turchetta et al., 2016] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in ﬁnite markov decision processes with gaussian pro- cesses. In Neural Information Processing Systems (NIPS) , pages 4305–4313, December

work page 2016

[1] [1]

Concrete Problems in AI Safety

[Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Man´ e. Concrete problems in AI safety. CoRR, abs/1606.06565,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

[Barto and Dietterich, 2004 ] A. G. Barto and T.G. Diet- terich. Reinforcement learning and its relationship to su- pervised learning. John Wiley and Sons, Inc,

work page 2004

[3] [3]

Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

[Christiano et al., 2017] Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems ,

work page 2017

[4] [4]

Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S

[Davis et al., 2007] Jason V . Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information- theoretic metric learning. In Proceedings of the 24th In- ternational Conference on Machine Learning , ICML ’07, pages 209–216, New Y ork, NY , USA,

work page 2007

[5] [5]

[Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja Giryes

ACM. [Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja Giryes. On the resistance of neural nets to label noise. CoRR, abs/1803.11410,

work page arXiv 2018

[6] [6]

Fairness through awareness

[Dwork et al., 2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Com- puter Science Conference,

work page 2012

[7] [7]

Reinforcement learning with a corrupted reward channel

[Everitt et al., 2017] Tom Everitt, Victoria Krakovna, Lau- rent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In International Joint Confer- ence on Artiﬁcial Intelligence ,

work page 2017

[8] [8]

Generalizing Skills with Semi-Supervised Reinforcement Learning

[Finn et al., 2016] Chelsea Finn, Tianhe Y u, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. arXiv preprint arXiv:1612.00429,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[9] [9]

Safe exploration of state and action spaces in reinforcement learning

[Garc´ ıa and Fern´ andez, 2012] Javier Garc´ ıa and Fernando Fern´ andez. Safe exploration of state and action spaces in reinforcement learning. J. Artif. Int. Res. , 45(1):515–564, September

work page 2012

[10] [10]

Roweis, 2005 ] Amir Globerson and Sam T

[Globerson and T. Roweis, 2005 ] Amir Globerson and Sam T. Roweis. Metric learning by collapsing classes. In Advances in Neural Information Processing Systems , vol- ume 18, 01

work page 2005

[11] [11]

Dragan, Pieter Abbeel, and Stuart J

[Hadﬁeld-Menell et al., 2016] Dylan Hadﬁeld-Menell, Anca D. Dragan, Pieter Abbeel, and Stuart J. Russell. Cooperative inverse reinforcement learning. CoRR, abs/1606.03137,

work page arXiv 2016

[12] [12]

Dhillon, and Kristen Grauman

[Jain et al., 2009] Prateek Jain, Brian Kulis, Inderjit S. Dhillon, and Kristen Grauman. Online metric learning and fast similarity search. In D. Koller, D. Schuurmans, Y . Bengio, and L. Bottou, editors, Advances in Neural In- formation Processing Systems 21 , pages 761–768. Curran Associates, Inc.,

work page 2009

[13] [13]

Al- gorithms for inverse reinforcement learning

[Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Al- gorithms for inverse reinforcement learning. In Icml, vol- ume 1, page 2,

work page 2000

[14] [14]

Riedl and Brent Harri- son

[Riedl and Harrison, 2016 ] Mark O. Riedl and Brent Harri- son. Using stories to teach human values to artiﬁcial agents. In AI, Ethics, and Society, Papers from the 2016 AAAI W orkshop, Phoenix, Arizona, USA, February 13, 2016.,

work page 2016

[15] [15]

Deep Learning is Robust to Massive Label Noise

[Rolnick et al., 2017] David Rolnick, Andreas V eit, Serge J. Belongie, and Nir Shavit. Deep learning is robust to mas- sive label noise. CoRR, abs/1705.10694,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Sutton, and Ashwin Ram

[Santamar´ ıaet al., 1997] Juan Carlos Santamar´ ıa, Richard S. Sutton, and Ashwin Ram. Experiments with reinforce- ment learning in problems with continuous state and action spaces. Adaptive Behaviour, 6:163–217,

work page 1997

[17] [17]

Proximal Policy Optimization Algorithms

[Schulman et al., 2017] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Taylor, Brian Kulis, and Fei Sha

[Taylor et al., 2011] Matthew E. Taylor, Brian Kulis, and Fei Sha. Metric learning for reinforcement learning agents. In AAMAS,

work page 2011

[19] [19]

Safe exploration in ﬁnite markov decision processes with gaussian pro- cesses

[Turchetta et al., 2016] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in ﬁnite markov decision processes with gaussian pro- cesses. In Neural Information Processing Systems (NIPS) , pages 4305–4313, December

work page 2016