arxiv: 2605.09638 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

Sze-Ann Chen , Zhi-Yi Chin , Kui-Yuan Chen , Chi-Yu Li , Ping-Chun Hsieh

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords backdoor defensereinforcement learningMonte Carlo Tree Searchtest-time defenseadversarial robustnesspolicy securitytrigger detectionRL backdoors

0 comments

The pith

Monte Carlo planning at test time detects and neutralizes backdoor triggers in reinforcement learning policies with only black-box access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Plan2Cleanse as a way to defend deployed RL models against backdoors by treating the search for activating trigger sequences as a planning task solved through Monte Carlo Tree Search. This recasts detection as systematic exploration of possible action sequences that could flip the policy to malicious behavior, followed by replanning to prevent those sequences from completing. A sympathetic reader cares because third-party trained RL agents in real systems can carry hidden triggers that stay dormant until activated, and many existing defenses demand retraining data or internal model details. The method keeps access limited to querying the policy for actions and uses the detection results to steer the agent away from danger at runtime. Tests in MuJoCo environments, O-RAN networks, and Atari games show the planning approach raises detection rates substantially and lifts competitive performance.

Core claim

Plan2Cleanse adapts Monte Carlo Tree Search to identify temporally extended trigger sequences that activate backdoors in RL policies and uses the detection outcomes for tree-search preventive replanning, all while requiring only black-box access to the target policy.

What carries the argument

Monte Carlo Tree Search recast as a planning algorithm that explores possible trigger sequences in the RL policy's state-action space and enables mitigation by replanning around discovered triggers.

If this is right

Trigger detection success rates increase by more than 61.4 percentage points in stealthy O-RAN scenarios.
Win rates rise from 35% to 53% in competitive Humanoid environments by neutralizing triggers through replanning.
Backdoor mitigation occurs at test time without requiring model retraining or white-box access to the policy.
The same planning framework applies across MuJoCo locomotion tasks, simulated wireless networks, and Atari games.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If backdoors rely on sequence-based triggers, comparable search methods might defend other black-box sequential decision systems such as planning agents in robotics.
Runtime monitors built on tree search could serve as a general layer for securing deployed RL policies against unknown attacks.
The approach might be extended by combining it with lightweight online adaptation to handle triggers that evolve over time.

Load-bearing premise

Backdoor triggers appear as temporally extended sequences of actions or states that can be efficiently discovered and neutralized through Monte Carlo planning while maintaining only black-box access to the target policy.

What would settle it

An experiment where no finite sequence activates the backdoor or where the search consistently fails to locate the true trigger despite its presence would show the planning method does not work.

Figures

Figures reproduced from arXiv: 2605.09638 by Chi-Yu Li, Kui-Yuan Chen, Ping-Chun Hsieh, Sze-Ann Chen, Zhi-Yi Chin.

**Figure 2.** Figure 2: An illustration of Voronoi-based sampling in continuous action spaces. At a node corresponding [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: TDSR over training iterations in mobile-env. Solid lines show the median across seeds and shaded areas show the interquartile range. Plan2Cleanse sustains high detection performance even for minimally responsive Trojans, whereas PolicyCleanse drops notably as responsiveness decreases. Backdoor Detection in O-RAN Simulator. We evaluate Plan2Cleanse in the mobile-env across three levels of Trojan responsiven… view at source ↗

**Figure 4.** Figure 4: Average data rate (GB/s) under adversarial triggers in mobile-env. Backdoor Mitigation in O-RAN Simulator. We evaluate Plan2Cleanse’s mitigation performance in the mobile-env, where benign UEs interact with a Trojan-infected base station controller [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: TDSR results under varying UE distances from the base station. Each plot reports the median [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: TDSR over training iterations in competitive RL environments. The solid lines show the median [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Win rate under adversarial triggers in Ant and Humanoid. Plan2Cleanse surpasses the benign policy in Humanoid. Backdoor Mitigation in Competitive RL Environments. We compare four agents under adversarial triggers: benign, Trojan without mitigation, Trojan with PolicyCleanse, and Trojan with Plan2Cleanse. As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of planning horizon H and rollout threshold hrollout on win rate gain (%) in the Humanoid environment. fixed budgets. On the other hand, the rollout threshold hrollout controls stochastic exploration during search: small values restrict exploration and risk imitating the Trojan policy, while large values inject excessive noise and lead to unstable value estimates; moderate settings (i.e., around 3 t… view at source ↗

**Figure 9.** Figure 9: Win rates under different perturbation strategies in the Ant and Humanoid environments. Each bar shows the average performance across three Trojan models. More Details on Backdoor Detection in Atari. In addition to the criteria specified in Section 5.1, we partition each Atari frame into 12 × 12 grids, yielding 49 (7 × 7) candidate patches. A trigger is accepted if inpainting a unique patch consistently … view at source ↗

**Figure 10.** Figure 10: Different trigger patterns: (a) Square, (b) Equal, (c) Cross, and (d) Checkerboard. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: t-SNE visualization of trigger action sequences across five settings: Ant, Humanoid, and three [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Effect of detection depth T on final TDSR in mobile-env under High, Partial, and Minimal Trojan responsiveness. 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0 Final TDSR (a) Ant 10 20 30 40 50 60 (b) Humanoid Detection depth T [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Effect of detection depth T on final TDSR for Ant and Humanoid. 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Final TDSR (a) Ant 0.1 0.2 0.3 0.4 0.5 (b) Humanoid Exploration Probability [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Effect of exploration probability ω on final TDSR for Ant and Humanoid. robust to the exploration–exploitation trade-off. The shaded regions indicate the standard deviation across Trojan models, and each point corresponds to the mean performance aggregated over different Trojan models. F.3 Mitigation Hyperparameter Sensitivity To complement the Ant results presented in Section 5.6, we further provide the … view at source ↗

**Figure 15.** Figure 15: Effect of planning horizon H and threshold hrollout on win rate gain (%) in the Ant environment. 1 3 5 7 10 10.0 10.5 11.0 11.5 12.0 Data Rate (GB/s) (a) High 1 3 5 7 10 (b) Partial 1 3 5 7 10 (c) Minimal Planning Horizon H [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Effect of joint planning horizon H = hrollout on data rate (GB/s) in the mobile-env. 10 15 20 25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Scores (a) Pong 10 15 20 25 10 11 12 13 14 15 16 17 18 19 20 21 (b) Breakout Planning Horizon H [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Effect of planning horizon H on mitigation performance in Atari: (a) Pong and (b) Breakout. G Ablation Study This section provides additional analyses under noisy-reward and sparse-reward settings in MuJoCo to further evaluate the robustness of Plan2Cleanse. G.1 Noisy-Reward Setting To assess robustness against noisy reward, we inject zero-mean Gaussian noise N (0, σ2 ) into the reward at each time step. … view at source ↗

**Figure 18.** Figure 18: Effect of noise standard deviation σ on final TDSR for Ant and Humanoid. G.2 Sparse-Reward Setting We further evaluate Plan2Cleanse under sparse rewards. Specifically, the environment reward is set to zero at every time step except at episode termination: the agent receives +1 upon winning and −1 upon losing. This modification introduces no additional hyperparameters. All other configurations follow those… view at source ↗

**Figure 19.** Figure 19: TDSR over training iterations in Ant and Humanoid. Solid lines show the median across seeds, [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

read the original abstract

Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35\% to 53\% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at https://github.com/rl-bandits-lab/RL-Backdoor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Plan2Cleanse frames RL backdoor defense as MCTS planning at test time, but the black-box scoring rule for spotting triggers is left unspecified.

read the letter

Plan2Cleanse recasts backdoor detection in reinforcement learning as a Monte Carlo Tree Search problem that runs at test time. The core move is to treat trigger sequences as paths to be explored through planning rather than something detected by a separate classifier or retraining step. That is the actual novelty here, and it is not just a re-labeling of existing RL security work. The paper reports concrete gains: over 61 percentage points better trigger detection in stealthy O-RAN settings and a win-rate lift from 35% to 53% in competitive Humanoid tasks. They also release the code, which is useful for anyone who wants to check the implementation directly. Those are the parts worth noting first. The main soft spot is the one the stress-test note flags. The method claims to stay black-box, querying only the target policy, yet the planner still needs some concrete way to decide that one sequence is a backdoor trigger and another is not. The abstract gives no description of that scoring function, whether it is reward deviation, state anomaly, or something else. Without it, the search has no reliable signal to follow, and the reported improvements rest on an unstated assumption that such a signal exists and works across environments. The experiments are described at a high level only, with no mention of controls, baselines, or variance, so it is difficult to judge how robust the numbers actually are. This paper is aimed at researchers working on deployed RL systems and security. A reader already thinking about test-time defenses will find the planning framing worth examining, even if the current write-up leaves the central mechanism underspecified. It is coherent enough on its own terms to deserve a serious referee, though any review will need to press hard on the missing scoring rule and the experimental details.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Plan2Cleanse, a test-time backdoor defense for RL policies that recasts trigger detection as a Monte Carlo Tree Search planning problem over temporally extended sequences. It maintains only black-box access to the target policy, uses the planning output for mitigation via preventive replanning, and reports empirical results across MuJoCo competitive environments, simulated O-RAN networks, and Atari games, including detection-rate gains exceeding 61.4 percentage points in stealthy O-RAN cases and win-rate improvement from 35% to 53% in Humanoid.

Significance. If the claims are substantiated, the work contributes a practical test-time defense that avoids retraining and white-box assumptions, addressing a growing concern for third-party-trained RL models in deployed systems. The public code release supports reproducibility.

major comments (2)

[§3] §3 (Method): The MCTS procedure requires a concrete evaluation function or scoring rule to identify backdoor-activating trajectories from black-box policy queries alone. No explicit definition, pseudocode, or implementation detail is supplied for this rule (e.g., reward deviation, state anomaly detector, or domain-specific heuristic), yet the reported 61.4 pp detection lift and 35 % → 53 % win-rate gain both presuppose that the rule reliably guides search to the correct trigger rather than high-variance or low-reward paths.
[§4] §4 (Experiments): The O-RAN and Humanoid results claim large absolute gains without reporting the number of independent trials, statistical significance tests, error bars, or the precise baselines and trigger-construction protocols used. This absence prevents verification that the improvements are robust rather than artifacts of particular environment configurations or evaluation choices.

minor comments (2)

The abstract and introduction use the term 'stealthy' triggers without a precise definition or reference to how stealth is quantified (e.g., trigger length, activation probability, or detectability by standard anomaly detectors).
[§3] Notation for the planning components (e.g., the tree policy, rollout policy, and backdoor-specific value function) should be introduced with a single consistent table or diagram early in §3 to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments in turn below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method): The MCTS procedure requires a concrete evaluation function or scoring rule to identify backdoor-activating trajectories from black-box policy queries alone. No explicit definition, pseudocode, or implementation detail is supplied for this rule (e.g., reward deviation, state anomaly detector, or domain-specific heuristic), yet the reported 61.4 pp detection lift and 35 % → 53 % win-rate gain both presuppose that the rule reliably guides search to the correct trigger rather than high-variance or low-reward paths.

Authors: We appreciate the referee pointing out the need for greater clarity on the MCTS evaluation function. Our approach employs a scoring rule that estimates the probability of backdoor activation by measuring deviations in observed rewards and action distributions from the policy's nominal behavior, using only black-box queries. We have now included an explicit mathematical definition of this scoring rule, along with pseudocode for the complete MCTS procedure, in the revised version of §3. This addition ensures that readers can understand how the search is guided toward trigger-activating trajectories. revision: yes
Referee: [§4] §4 (Experiments): The O-RAN and Humanoid results claim large absolute gains without reporting the number of independent trials, statistical significance tests, error bars, or the precise baselines and trigger-construction protocols used. This absence prevents verification that the improvements are robust rather than artifacts of particular environment configurations or evaluation choices.

Authors: We agree that the experimental section would benefit from additional details to support the robustness of the results. In the revised manuscript, we have added information on the number of independent trials conducted (specifically, 10 runs for each reported result), the statistical significance tests performed (including p-values from t-tests), error bars in the figures, and precise descriptions of the baseline algorithms and trigger construction methods for the O-RAN and Humanoid environments. These updates appear in §4 and the associated supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity; method is an independent planning procedure

full rationale

The paper introduces Plan2Cleanse as a novel test-time framework that recasts backdoor detection in RL as a Monte Carlo planning problem over trigger sequences, using only black-box policy queries. No equations, fitted parameters, or self-citations are presented that reduce the central construction to its own inputs by definition. The reported performance gains (e.g., 61.4 pp detection lift) are framed as empirical outcomes of applying the planning procedure, not as quantities forced by construction from prior fits or renamed known results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that backdoor triggers can be modeled as temporally extended action sequences amenable to tree search; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Backdoor triggers manifest as temporally extended sequences that can be systematically explored via planning under black-box policy access
The method recasts detection as a planning problem, which presupposes this structure of triggers.

pith-pipeline@v0.9.0 · 5557 in / 1217 out tokens · 47848 ms · 2026-05-12T02:03:46.269036+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We unify backdoor detection across different attacks by recasting it as a planning problem... employ MCTS... guided by a scalar reward signal that reflects the degradation of the performance of the Trojan policy... Q(aadv1:T) := E[∑ γ^{t-1} r(-)_t]
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Voronoi-guided sampling... interactionDefect... branch selection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

[1]

Universal Trojan signatures in reinforcement learning

Manoj Acharya, Weichao Zhou, Anirban Roy, Xiao Lin, Wenchao Li, and Susmit Jha. Universal Trojan signatures in reinforcement learning. InNeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly,

work page 2023
[2]

A method for evaluating hyperparameter sensitivity in reinforcementlearning.Advances in Neural Information Processing Systems (NeurIPS),37:124820–124842,

16 Published in Transactions on Machine Learning Research (05/2026) Jacob Adkins, Michael Bowling, and Adam White. A method for evaluating hyperparameter sensitivity in reinforcementlearning.Advances in Neural Information Processing Systems (NeurIPS),37:124820–124842,

work page 2026
[3]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

AEVA: Black-box backdoor detection using adversarial extreme value analysis

17 Published in Transactions on Machine Learning Research (05/2026) Junfeng Guo, Ang Li, and Cong Liu. AEVA: Black-box backdoor detection using adversarial extreme value analysis. InInternational Conference on Learning Representations (ICLR),

work page 2026
[5]

Sionna: An Open-Source Library for Next-Generation Physical Layer Research,

Jakob Hoydis, Sebastian Cammerer, Fayçal Ait Aoudia, Avinash Vem, Nikolaus Binder, Guillermo Marcus, and Alexander Keller. Sionna: An open-source library for next-generation physical layer research.arXiv preprint arXiv:2203.11854,

work page arXiv
[6]

Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. InMachine Learning Proceedings 1994, pp. 157–163,

work page 1994
[7]

Fine-pruning: Defending against backdooring attacks on deep neural networks

Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. InInternational Symposium on Research in Attacks, Intrusions, and Defenses (RAID), Heraklion, Crete, Greece, 2018a. Shijie Liu, Andrew C Cullen, Paul Montague, Sarah Erfani, and Benjamin IP Rubinstein. Fox in the hen- house: Sup...

work page arXiv
[8]

Trojaning attack on neural networks

Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. InAnnual Network And Distributed System Security Symposium (NDSS), 2018b. 18 Published in Transactions on Machine Learning Research (05/2026) Madhusanka Liyanage, An Braeken, Shahriar Shahabuddin, and Pasika Ranaweera. Open...

work page 2026
[9]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Munoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac Lab: A GPU-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Mitigating deep reinforcement learning backdoors in the neural activation space

19 Published in Transactions on Machine Learning Research (05/2026) Sanyam Vyas, Chris Hicks, and Vasilios Mavroudis. Mitigating deep reinforcement learning backdoors in the neural activation space. InIEEE Security and Privacy Workshops (SPW), pp. 76–86,

work page 2026
[13]

Beyond training-time poisoning: Component-level and post-training backdoors in deep reinforcement learning.arXiv preprint arXiv:2507.04883,

Sanyam Vyas, Alberto Caron, Chris Hicks, Pete Burnap, and Vasilios Mavroudis. Beyond training-time poisoning: Component-level and post-training backdoors in deep reinforcement learning.arXiv preprint arXiv:2507.04883,

work page arXiv
[14]

Advsim: Generating safety-critical scenarios for self-driving vehicles

Jingkang Wang, Ava Pun, James Tu, Sivabalan Manivasagam, Abbas Sadat, Sergio Casas, Mengye Ren, and Raquel Urtasun. Advsim: Generating safety-critical scenarios for self-driving vehicles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9909–9918, 2021a. Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Guo-Xun Ko...

work page arXiv
[15]

20 Published in Transactions on Machine Learning Research (05/2026) A Threat-Model Instantiations in Realistic RL Deployments To further substantiate the threat model in Section 3.3, we describe how it is realized in the three represen- tative domains considered in our experiments. •O-RAN Wireless Networks.In this setting, the target policy operates as a ...

work page 2026
[16]

These Trojan models form the evaluation testbed for both detection and mitigation

to implant patch-style triggers and retrain Trojan agents accordingly. These Trojan models form the evaluation testbed for both detection and mitigation. Trigger Criteria.To evaluate whether a discovered sequence constitutes a successful trigger, we define environment-specific acceptance criteria: •Ant: A trigger is accepted if it causes a statistically s...

work page 2020
[17]

Letrsum denote the negated cumulative reward of the replayed candidate sequence, and letrref be a reference distribution obtained from 500 random action sequences

and apply an anomaly detection procedure based on the Median Absolute Deviation (MAD). Letrsum denote the negated cumulative reward of the replayed candidate sequence, and letrref be a reference distribution obtained from 500 random action sequences. We compute the anomaly index as: Anomaly Index(rsum) := rsum−Median(rref) C·Median(|rref−Median(rref)|), w...

work page 2026
[18]

Table 4: Environment-specific hyperparameters for Plan2Cleanse detection and mitigation. Parameter Ant Humanoid Mobile-env Pong Breakout Detection DepthT60 10 10 1 1 Mitigatoin BudgetN500 500 10 30 50 Rollout Thresholdhrollout 3 3 5 1 1 Planning HorizonH5 5 5 20 20 Baseline Reproduction.For baseline reproduction, we matched the environment step magnitudes...

work page 2024
[19]

Results are mean±std

C Various Attack Scenarios in Atari Games Table 5: Performance comparison under poisoned and clean environments for4×4patterns. Results are mean±std. Environment Method Square Equal Cross Checkerboard Poisoned Trojan0.033±0.145−0.127±0.064−0.147±0.170 0.053±0.189 Plan2Cleanse (Ours)0.950±0.014 0.973±0.012 0.787±0.151 0.880±0.060 Clean Trojan1.000±0.000 1....

work page 2026
[20]

experience a marked decline in prediction accuracy under adversarial manipulation, leading to measurable deterioration in system throughput and capacity. The modular and decentralized nature of O-RAN (Farooq etal.,2019)furtheramplifiestherisk, ascompromisedagentsmaytamperwithsharedobservationsordisrupt the behavior of co-located services. Addressing such ...

work page 2019
[21]

Moreover, a complete version of the replanning procedure for backdoor mitigation and the procedure for generating Trojan rollouts are provided in Algorithm 5 and Algorithm 6, respectively. Algorithm 4Danger State Marking in Detection TreeD 1:Input:Detection TreeD, Leaf nodes L, Backtrack depthK 2:Output:Updated detection treeT det with danger states marke...

work page 2026