Recognition: 2 theorem links
· Lean TheoremPlan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning
Pith reviewed 2026-05-12 02:03 UTC · model grok-4.3
The pith
Monte Carlo planning at test time detects and neutralizes backdoor triggers in reinforcement learning policies with only black-box access.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Plan2Cleanse adapts Monte Carlo Tree Search to identify temporally extended trigger sequences that activate backdoors in RL policies and uses the detection outcomes for tree-search preventive replanning, all while requiring only black-box access to the target policy.
What carries the argument
Monte Carlo Tree Search recast as a planning algorithm that explores possible trigger sequences in the RL policy's state-action space and enables mitigation by replanning around discovered triggers.
If this is right
- Trigger detection success rates increase by more than 61.4 percentage points in stealthy O-RAN scenarios.
- Win rates rise from 35% to 53% in competitive Humanoid environments by neutralizing triggers through replanning.
- Backdoor mitigation occurs at test time without requiring model retraining or white-box access to the policy.
- The same planning framework applies across MuJoCo locomotion tasks, simulated wireless networks, and Atari games.
Where Pith is reading between the lines
- If backdoors rely on sequence-based triggers, comparable search methods might defend other black-box sequential decision systems such as planning agents in robotics.
- Runtime monitors built on tree search could serve as a general layer for securing deployed RL policies against unknown attacks.
- The approach might be extended by combining it with lightweight online adaptation to handle triggers that evolve over time.
Load-bearing premise
Backdoor triggers appear as temporally extended sequences of actions or states that can be efficiently discovered and neutralized through Monte Carlo planning while maintaining only black-box access to the target policy.
What would settle it
An experiment where no finite sequence activates the backdoor or where the search consistently fails to locate the true trigger despite its presence would show the planning method does not work.
Figures
read the original abstract
Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35\% to 53\% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at https://github.com/rl-bandits-lab/RL-Backdoor.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Plan2Cleanse, a test-time backdoor defense for RL policies that recasts trigger detection as a Monte Carlo Tree Search planning problem over temporally extended sequences. It maintains only black-box access to the target policy, uses the planning output for mitigation via preventive replanning, and reports empirical results across MuJoCo competitive environments, simulated O-RAN networks, and Atari games, including detection-rate gains exceeding 61.4 percentage points in stealthy O-RAN cases and win-rate improvement from 35% to 53% in Humanoid.
Significance. If the claims are substantiated, the work contributes a practical test-time defense that avoids retraining and white-box assumptions, addressing a growing concern for third-party-trained RL models in deployed systems. The public code release supports reproducibility.
major comments (2)
- [§3] §3 (Method): The MCTS procedure requires a concrete evaluation function or scoring rule to identify backdoor-activating trajectories from black-box policy queries alone. No explicit definition, pseudocode, or implementation detail is supplied for this rule (e.g., reward deviation, state anomaly detector, or domain-specific heuristic), yet the reported 61.4 pp detection lift and 35 % → 53 % win-rate gain both presuppose that the rule reliably guides search to the correct trigger rather than high-variance or low-reward paths.
- [§4] §4 (Experiments): The O-RAN and Humanoid results claim large absolute gains without reporting the number of independent trials, statistical significance tests, error bars, or the precise baselines and trigger-construction protocols used. This absence prevents verification that the improvements are robust rather than artifacts of particular environment configurations or evaluation choices.
minor comments (2)
- The abstract and introduction use the term 'stealthy' triggers without a precise definition or reference to how stealth is quantified (e.g., trigger length, activation probability, or detectability by standard anomaly detectors).
- [§3] Notation for the planning components (e.g., the tree policy, rollout policy, and backdoor-specific value function) should be introduced with a single consistent table or diagram early in §3 to aid readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major comments in turn below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Method): The MCTS procedure requires a concrete evaluation function or scoring rule to identify backdoor-activating trajectories from black-box policy queries alone. No explicit definition, pseudocode, or implementation detail is supplied for this rule (e.g., reward deviation, state anomaly detector, or domain-specific heuristic), yet the reported 61.4 pp detection lift and 35 % → 53 % win-rate gain both presuppose that the rule reliably guides search to the correct trigger rather than high-variance or low-reward paths.
Authors: We appreciate the referee pointing out the need for greater clarity on the MCTS evaluation function. Our approach employs a scoring rule that estimates the probability of backdoor activation by measuring deviations in observed rewards and action distributions from the policy's nominal behavior, using only black-box queries. We have now included an explicit mathematical definition of this scoring rule, along with pseudocode for the complete MCTS procedure, in the revised version of §3. This addition ensures that readers can understand how the search is guided toward trigger-activating trajectories. revision: yes
-
Referee: [§4] §4 (Experiments): The O-RAN and Humanoid results claim large absolute gains without reporting the number of independent trials, statistical significance tests, error bars, or the precise baselines and trigger-construction protocols used. This absence prevents verification that the improvements are robust rather than artifacts of particular environment configurations or evaluation choices.
Authors: We agree that the experimental section would benefit from additional details to support the robustness of the results. In the revised manuscript, we have added information on the number of independent trials conducted (specifically, 10 runs for each reported result), the statistical significance tests performed (including p-values from t-tests), error bars in the figures, and precise descriptions of the baseline algorithms and trigger construction methods for the O-RAN and Humanoid environments. These updates appear in §4 and the associated supplementary material. revision: yes
Circularity Check
No circularity; method is an independent planning procedure
full rationale
The paper introduces Plan2Cleanse as a novel test-time framework that recasts backdoor detection in RL as a Monte Carlo planning problem over trigger sequences, using only black-box policy queries. No equations, fitted parameters, or self-citations are presented that reduce the central construction to its own inputs by definition. The reported performance gains (e.g., 61.4 pp detection lift) are framed as empirical outcomes of applying the planning procedure, not as quantities forced by construction from prior fits or renamed known results. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Backdoor triggers manifest as temporally extended sequences that can be systematically explored via planning under black-box policy access
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We unify backdoor detection across different attacks by recasting it as a planning problem... employ MCTS... guided by a scalar reward signal that reflects the degradation of the performance of the Trojan policy... Q(aadv1:T) := E[∑ γ^{t-1} r(-)_t]
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Voronoi-guided sampling... interactionDefect... branch selection
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Universal Trojan signatures in reinforcement learning
Manoj Acharya, Weichao Zhou, Anirban Roy, Xiao Lin, Wenchao Li, and Susmit Jha. Universal Trojan signatures in reinforcement learning. InNeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly,
work page 2023
-
[2]
16 Published in Transactions on Machine Learning Research (05/2026) Jacob Adkins, Michael Bowling, and Adam White. A method for evaluating hyperparameter sensitivity in reinforcementlearning.Advances in Neural Information Processing Systems (NeurIPS),37:124820–124842,
work page 2026
-
[3]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
AEVA: Black-box backdoor detection using adversarial extreme value analysis
17 Published in Transactions on Machine Learning Research (05/2026) Junfeng Guo, Ang Li, and Cong Liu. AEVA: Black-box backdoor detection using adversarial extreme value analysis. InInternational Conference on Learning Representations (ICLR),
work page 2026
-
[5]
Sionna: An Open-Source Library for Next-Generation Physical Layer Research,
Jakob Hoydis, Sebastian Cammerer, Fayçal Ait Aoudia, Avinash Vem, Nikolaus Binder, Guillermo Marcus, and Alexander Keller. Sionna: An open-source library for next-generation physical layer research.arXiv preprint arXiv:2203.11854,
-
[6]
Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. InMachine Learning Proceedings 1994, pp. 157–163,
work page 1994
-
[7]
Fine-pruning: Defending against backdooring attacks on deep neural networks
Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. InInternational Symposium on Research in Attacks, Intrusions, and Defenses (RAID), Heraklion, Crete, Greece, 2018a. Shijie Liu, Andrew C Cullen, Paul Montague, Sarah Erfani, and Benjamin IP Rubinstein. Fox in the hen- house: Sup...
-
[8]
Trojaning attack on neural networks
Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. InAnnual Network And Distributed System Security Symposium (NDSS), 2018b. 18 Published in Transactions on Machine Learning Research (05/2026) Madhusanka Liyanage, An Braeken, Shahriar Shahabuddin, and Pasika Ranaweera. Open...
work page 2026
-
[9]
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Munoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac Lab: A GPU-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Mitigating deep reinforcement learning backdoors in the neural activation space
19 Published in Transactions on Machine Learning Research (05/2026) Sanyam Vyas, Chris Hicks, and Vasilios Mavroudis. Mitigating deep reinforcement learning backdoors in the neural activation space. InIEEE Security and Privacy Workshops (SPW), pp. 76–86,
work page 2026
-
[13]
Sanyam Vyas, Alberto Caron, Chris Hicks, Pete Burnap, and Vasilios Mavroudis. Beyond training-time poisoning: Component-level and post-training backdoors in deep reinforcement learning.arXiv preprint arXiv:2507.04883,
-
[14]
Advsim: Generating safety-critical scenarios for self-driving vehicles
Jingkang Wang, Ava Pun, James Tu, Sivabalan Manivasagam, Abbas Sadat, Sergio Casas, Mengye Ren, and Raquel Urtasun. Advsim: Generating safety-critical scenarios for self-driving vehicles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9909–9918, 2021a. Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Guo-Xun Ko...
-
[15]
20 Published in Transactions on Machine Learning Research (05/2026) A Threat-Model Instantiations in Realistic RL Deployments To further substantiate the threat model in Section 3.3, we describe how it is realized in the three represen- tative domains considered in our experiments. •O-RAN Wireless Networks.In this setting, the target policy operates as a ...
work page 2026
-
[16]
These Trojan models form the evaluation testbed for both detection and mitigation
to implant patch-style triggers and retrain Trojan agents accordingly. These Trojan models form the evaluation testbed for both detection and mitigation. Trigger Criteria.To evaluate whether a discovered sequence constitutes a successful trigger, we define environment-specific acceptance criteria: •Ant: A trigger is accepted if it causes a statistically s...
work page 2020
-
[17]
and apply an anomaly detection procedure based on the Median Absolute Deviation (MAD). Letrsum denote the negated cumulative reward of the replayed candidate sequence, and letrref be a reference distribution obtained from 500 random action sequences. We compute the anomaly index as: Anomaly Index(rsum) := rsum−Median(rref) C·Median(|rref−Median(rref)|), w...
work page 2026
-
[18]
Table 4: Environment-specific hyperparameters for Plan2Cleanse detection and mitigation. Parameter Ant Humanoid Mobile-env Pong Breakout Detection DepthT60 10 10 1 1 Mitigatoin BudgetN500 500 10 30 50 Rollout Thresholdhrollout 3 3 5 1 1 Planning HorizonH5 5 5 20 20 Baseline Reproduction.For baseline reproduction, we matched the environment step magnitudes...
work page 2024
-
[19]
C Various Attack Scenarios in Atari Games Table 5: Performance comparison under poisoned and clean environments for4×4patterns. Results are mean±std. Environment Method Square Equal Cross Checkerboard Poisoned Trojan0.033±0.145−0.127±0.064−0.147±0.170 0.053±0.189 Plan2Cleanse (Ours)0.950±0.014 0.973±0.012 0.787±0.151 0.880±0.060 Clean Trojan1.000±0.000 1....
work page 2026
-
[20]
experience a marked decline in prediction accuracy under adversarial manipulation, leading to measurable deterioration in system throughput and capacity. The modular and decentralized nature of O-RAN (Farooq etal.,2019)furtheramplifiestherisk, ascompromisedagentsmaytamperwithsharedobservationsordisrupt the behavior of co-located services. Addressing such ...
work page 2019
-
[21]
Moreover, a complete version of the replanning procedure for backdoor mitigation and the procedure for generating Trojan rollouts are provided in Algorithm 5 and Algorithm 6, respectively. Algorithm 4Danger State Marking in Detection TreeD 1:Input:Detection TreeD, Leaf nodes L, Backtrack depthK 2:Output:Updated detection treeT det with danger states marke...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.