pith. sign in

arxiv: 1907.00452 · v1 · pith:2L2IAQMInew · submitted 2019-06-30 · 💻 cs.LG · stat.ML

Detecting Spiky Corruption in Markov Decision Processes

Pith reviewed 2026-05-25 12:15 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords reinforcement learningcorrupt rewardsMarkov decision processesCRMDPregret boundsreward corruptionspiky corruptionstate detection
0
0 comments X

The pith

If reward corruption in a CRMDP is sufficiently spiky, corrupt states can be detected so any standard RL algorithm learns the optimal policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning breaks when observed rewards differ from true rewards received. The paper studies this in Corrupt Reward Markov Decision Processes. It proves that spiky corruption, where mismatches concentrate in identifiable patterns, makes the setting solvable. A detection algorithm locates the corrupt states, after which any common RL method can recover the optimal policy. The paper also gives a complete characterization of the resulting regret bounds and tests the approach in gridworlds.

Core claim

In a Spiky CRMDP the reward corruption satisfies a structural spikiness condition that renders the environment solvable. The regret bound is fully characterized, and a detection algorithm identifies the corrupt states, after which any common reinforcement learning algorithm can be applied to learn the optimal policy.

What carries the argument

The spikiness condition on reward corruption together with a state-detection algorithm that isolates corrupt states in CRMDPs.

If this is right

  • Regret in Spiky CRMDPs is fully characterized and finite.
  • Corrupt states can be isolated by the detection algorithm.
  • Any standard RL algorithm applied after detection recovers the optimal policy.
  • The approach works in simple gridworld environments with introduced corruption.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same detection step could be inserted as a preprocessing stage before any existing RL codebase is run on noisy-reward data.
  • If real-world reward signals contain concentrated errors, the method suggests a way to sanitize them without redesigning the learner.
  • Extending the spikiness test to continuous state spaces would require checking whether the same concentration property still permits reliable detection.

Load-bearing premise

Reward corruption must be sufficiently spiky so that corrupt states become detectable and the regret bound remains tractable.

What would settle it

A CRMDP instance whose corruption violates the spikiness condition, in which the detection algorithm either misses corrupt states or the observed regret exceeds the claimed bound.

Figures

Figures reproduced from arXiv: 1907.00452 by Alok Singh, David Lindner, Jason Mancuso, Tomasz Kisielewski.

Figure 1
Figure 1. Figure 1: Toy Spiky CRMDP envorinments Corners on the left, On￾TheWay on the right. The blue cell in the lower right hand corner is the starting position of the agent and the green cell in the upper left hand corner the goal. The true reward collected in each state is determined by the max-distance to the goal and shown here by the numbers in each cell. The reward in the red cells is corrupted and the underlying tru… view at source ↗
read the original abstract

Current reinforcement learning methods fail if the reward function is imperfect, i.e. if the agent observes reward different from what it actually receives. We study this problem within the formalism of Corrupt Reward Markov Decision Processes (CRMDPs). We show that if the reward corruption in a CRMDP is sufficiently "spiky", the environment is solvable. We fully characterize the regret bound of a Spiky CRMDP, and introduce an algorithm that is able to detect its corrupt states. We show that this algorithm can be used to learn the optimal policy with any common reinforcement learning algorithm. Finally, we investigate our algorithm in a pair of simple gridworld environments, finding that our algorithm can detect the corrupt states and learn the optimal policy despite the corruption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper studies Corrupt Reward Markov Decision Processes (CRMDPs) in which observed rewards differ from true rewards. It claims that when reward corruption is sufficiently 'spiky', the CRMDP is solvable; the regret bound is fully characterized; an algorithm exists to detect corrupt states; and the detection step allows any standard RL algorithm to recover the optimal policy. The claims are supported by experiments on two simple gridworld environments.

Significance. If the spikiness condition holds, the work supplies a concrete detection mechanism that restores solvability to an otherwise intractable corruption model and supplies an explicit regret characterization. This is a targeted but useful contribution to robust RL, as it isolates a structural restriction under which standard algorithms can be made to work without modification.

minor comments (2)
  1. [Abstract] The abstract states that the regret bound of a Spiky CRMDP is 'fully characterized' but does not indicate whether the bound is parameter-free or depends on the spikiness parameters; a one-sentence clarification would help readers assess the strength of the theoretical result.
  2. [Experiments] The experimental section refers to 'a pair of simple gridworld environments' without specifying state-space size, transition structure, or the precise locations and magnitudes of the spiky corruptions; adding these details would improve reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. The assessment that the spikiness condition yields a concrete detection mechanism and explicit regret bound is consistent with the manuscript's contributions.

Circularity Check

0 steps flagged

No significant circularity; claims are conditional on explicit structural assumption

full rationale

The paper introduces the spikiness condition as an explicit restriction on reward corruption in CRMDPs and derives solvability, regret bounds, and a detection algorithm under that condition. No equations, fitted parameters, or self-citations appear in the provided text that would reduce the central results to their own inputs by construction. The derivation is self-contained because the key claims are stated as holding precisely when the spikiness assumption is satisfied, with no load-bearing step that renames a fit or imports uniqueness via prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the spikiness condition itself functions as an unelaborated domain assumption.

pith-pipeline@v0.9.0 · 5654 in / 1169 out tokens · 24825 ms · 2026-05-25T12:15:10.718834+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 4 internal anchors

  1. [1]

    Concrete Problems in AI Safety

    [Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Man´ e. Concrete problems in AI safety. CoRR, abs/1606.06565,

  2. [2]

    [Barto and Dietterich, 2004 ] A. G. Barto and T.G. Diet- terich. Reinforcement learning and its relationship to su- pervised learning. John Wiley and Sons, Inc,

  3. [3]

    Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

    [Christiano et al., 2017] Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems ,

  4. [4]

    Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S

    [Davis et al., 2007] Jason V . Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information- theoretic metric learning. In Proceedings of the 24th In- ternational Conference on Machine Learning , ICML ’07, pages 209–216, New Y ork, NY , USA,

  5. [5]

    [Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja Giryes

    ACM. [Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja Giryes. On the resistance of neural nets to label noise. CoRR, abs/1803.11410,

  6. [6]

    Fairness through awareness

    [Dwork et al., 2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Com- puter Science Conference,

  7. [7]

    Reinforcement learning with a corrupted reward channel

    [Everitt et al., 2017] Tom Everitt, Victoria Krakovna, Lau- rent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In International Joint Confer- ence on Artificial Intelligence ,

  8. [8]

    Generalizing Skills with Semi-Supervised Reinforcement Learning

    [Finn et al., 2016] Chelsea Finn, Tianhe Y u, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. arXiv preprint arXiv:1612.00429,

  9. [9]

    Safe exploration of state and action spaces in reinforcement learning

    [Garc´ ıa and Fern´ andez, 2012] Javier Garc´ ıa and Fernando Fern´ andez. Safe exploration of state and action spaces in reinforcement learning. J. Artif. Int. Res. , 45(1):515–564, September

  10. [10]

    Roweis, 2005 ] Amir Globerson and Sam T

    [Globerson and T. Roweis, 2005 ] Amir Globerson and Sam T. Roweis. Metric learning by collapsing classes. In Advances in Neural Information Processing Systems , vol- ume 18, 01

  11. [11]

    Dragan, Pieter Abbeel, and Stuart J

    [Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, Anca D. Dragan, Pieter Abbeel, and Stuart J. Russell. Cooperative inverse reinforcement learning. CoRR, abs/1606.03137,

  12. [12]

    Dhillon, and Kristen Grauman

    [Jain et al., 2009] Prateek Jain, Brian Kulis, Inderjit S. Dhillon, and Kristen Grauman. Online metric learning and fast similarity search. In D. Koller, D. Schuurmans, Y . Bengio, and L. Bottou, editors, Advances in Neural In- formation Processing Systems 21 , pages 761–768. Curran Associates, Inc.,

  13. [13]

    Al- gorithms for inverse reinforcement learning

    [Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Al- gorithms for inverse reinforcement learning. In Icml, vol- ume 1, page 2,

  14. [14]

    Riedl and Brent Harri- son

    [Riedl and Harrison, 2016 ] Mark O. Riedl and Brent Harri- son. Using stories to teach human values to artificial agents. In AI, Ethics, and Society, Papers from the 2016 AAAI W orkshop, Phoenix, Arizona, USA, February 13, 2016.,

  15. [15]

    Deep Learning is Robust to Massive Label Noise

    [Rolnick et al., 2017] David Rolnick, Andreas V eit, Serge J. Belongie, and Nir Shavit. Deep learning is robust to mas- sive label noise. CoRR, abs/1705.10694,

  16. [16]

    Sutton, and Ashwin Ram

    [Santamar´ ıaet al., 1997] Juan Carlos Santamar´ ıa, Richard S. Sutton, and Ashwin Ram. Experiments with reinforce- ment learning in problems with continuous state and action spaces. Adaptive Behaviour, 6:163–217,

  17. [17]

    Proximal Policy Optimization Algorithms

    [Schulman et al., 2017] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  18. [18]

    Taylor, Brian Kulis, and Fei Sha

    [Taylor et al., 2011] Matthew E. Taylor, Brian Kulis, and Fei Sha. Metric learning for reinforcement learning agents. In AAMAS,

  19. [19]

    Safe exploration in finite markov decision processes with gaussian pro- cesses

    [Turchetta et al., 2016] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite markov decision processes with gaussian pro- cesses. In Neural Information Processing Systems (NIPS) , pages 4305–4313, December