pith. sign in

arxiv: 2605.23220 · v1 · pith:SAOKXUIRnew · submitted 2026-05-22 · 💻 cs.LG

WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

Pith reviewed 2026-05-25 05:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords adversarial robustnessworld modelsreinforcement learningattack searchrobustness evaluationAtariDeepMind Control
0
0 comments X

The pith

WMAttack automates search over attack configurations to find stronger adversarial evaluations for world-model agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WMAttack as a finite-budget search framework that treats attack selection as an optimization problem over families, budgets, steps, and allocation rules. It introduces Self-Correcting Attack Search to shift proposal probabilities toward higher-utility attacks using signals from reward drops, action changes, cost, and variability, plus Representation-Guided Attack Retrieval to reuse effective settings from similar prior tasks. Experiments on DreamerV3 Atari and DeepMind Control show the method yields higher normalized reward drops than baselines. A supporting argument shows that such refinement improves search quality under limited evaluations. The result matters because weaker manual attacks can falsely suggest greater robustness than actually exists.

Core claim

WMAttack formulates robustness evaluation as finite-budget search over attack configurations. Self-Correcting Attack Search refines the proposal distribution using feedback from reward degradation, action instability, runtime cost, and rollout variability. Representation-Guided Attack Retrieval retrieves effective historical configurations from representation-similar tasks to warm-start new environments. Across Atari and DMC tasks this discovers stronger attacks, raising normalized reward drop from 0.497 to 1.034 on DreamerV3 Atari and from 0.319 to 0.682 on DMC.

What carries the argument

Self-Correcting Attack Search (SCAS) that updates proposal distribution from runtime feedback signals, combined with Representation-Guided Attack Retrieval (RGAR) that supplies warm-start configurations from similar tasks.

If this is right

  • WMAttack yields higher normalized reward drops than evaluated baselines on both Atari and DMC suites.
  • SCAS improves final attack utility when evaluation budgets are fixed.
  • RGAR raises the quality of initial candidate attacks for new tasks.
  • The framework supplies a theoretical condition under which proposal refinement improves finite-budget search accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same search machinery could be applied to evaluate robustness of other latent-dynamics agents beyond the tested DreamerV3 instances.
  • If the feedback loop generalizes, it offers a route to make attack search itself part of an iterative defense loop for world-model training.
  • Representation similarity used for retrieval could be replaced by other task embeddings if the current metric proves brittle on out-of-distribution environments.

Load-bearing premise

Feedback signals from reward degradation, action instability, runtime cost, and rollout variability can reliably shift the attack proposal distribution toward higher-utility configurations without introducing selection bias or overfitting.

What would settle it

On a held-out environment, running the same fixed evaluation budget shows that manually tuned or random attacks achieve equal or higher normalized reward drop than WMAttack outputs.

Figures

Figures reproduced from arXiv: 2605.23220 by Andras Balogh, Cheng Guo, Dacheng Tao, Mark Jelasity, Shi Fu, Siyuan Liang, Zhixiang Guo.

Figure 1
Figure 1. Figure 1: Overview of WMAttack effectiveness across tasks and world-model agents. Metrics are [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of WMAttack. RGAR initializes the attack proposal distribution from [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task-level reward-drop distribution and per-task improvement on DreamerV3 Atari and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attack-specific search efficiency on DreamerV3 Atari and DMC. WMAttack generally [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Design analysis of WMAttack on DreamerV3 Atari. RGAR improves first-round candidate [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
read the original abstract

Despite the growing use of world models as decision-making agents, their adversarial robustness remains underexplored due to the lack of dedicated automated evaluation methods. A key obstacle is that attack evaluation must be both accurate and efficient: weak manually tuned attacks can overestimate robustness, while exhaustive hyperparameter search is prohibitively expensive because each candidate requires closed-loop rollouts through learned latent dynamics. We introduce WMAttack, an automated attack-search framework for adversarial evaluation of world-model agents. WMAttack formulates robustness evaluation as a finite-budget search over attack configurations, including attack families, perturbation budgets, optimization steps, restarts, and allocation rules. To improve search accuracy, Self-Correcting Attack Search (SCAS) refines the attack proposal distribution using feedback from reward degradation, action instability, runtime cost, and rollout variability. To improve search efficiency, Representation-Guided Attack Retrieval (RGAR) retrieves effective historical configurations from representation-similar tasks, providing a warm start for unseen environments. We provide a theoretical explanation showing that proposal refinement improves finite-budget search when it shifts probability mass toward high-utility attacks. Across Atari and DeepMind Control tasks, WMAttack consistently discovers stronger attacks than the evaluated baselines, improving normalized reward drop from 0.497 to 1.034 on DreamerV3 Atari and from 0.319 to 0.682 on DMC. Ablations further show that RGAR improves initial candidate quality and SCAS improves final attack utility under fixed evaluation budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces WMAttack, an automated finite-budget search framework for finding adversarial attacks on world-model agents. It proposes Self-Correcting Attack Search (SCAS), which iteratively refines the proposal distribution over attack configurations (families, budgets, steps, restarts, allocations) using feedback signals from reward degradation, action instability, runtime cost, and rollout variability, together with Representation-Guided Attack Retrieval (RGAR) that warm-starts from representation-similar tasks. A theoretical argument is supplied showing that such refinement improves search when it shifts mass toward high-utility attacks. On DreamerV3 Atari and DeepMind Control tasks the method reports normalized reward drops rising from 0.497 to 1.034 and from 0.319 to 0.682 respectively, with ablations indicating that RGAR improves initial candidates and SCAS improves final utility under fixed budgets.

Significance. If the empirical gains are robust, the work supplies a concrete, automated procedure for more reliable adversarial evaluation of world-model agents, an area the abstract correctly identifies as underexplored. The combination of a finite-budget search formulation, an explicit (if conditional) theoretical justification for proposal refinement, and ablations that isolate the contribution of SCAS and RGAR constitutes a clear methodological advance over purely manual or exhaustive tuning. The reported improvements on two standard benchmarks (Atari, DMC) would be directly usable by practitioners evaluating robustness of latent-dynamics agents.

major comments (3)
  1. [Abstract] Abstract and Experiments section: the headline gains (0.497→1.034 on DreamerV3 Atari; 0.319→0.682 on DMC) are stated without error bars, standard deviations, number of independent runs, or statistical tests, so it is impossible to determine whether the reported differences exceed run-to-run variability.
  2. [Theoretical explanation] Theoretical explanation paragraph: the argument establishes improvement only conditional on the four feedback signals correctly shifting probability mass to genuinely higher-utility attacks; it supplies no bound or analysis showing that reward degradation, action instability, runtime cost, and rollout variability computed on the identical evaluation environments are free of selection bias or environment-specific artifacts (latent dynamics or reward-predictor errors) that would not generalize.
  3. [Ablations] Ablations paragraph: the reported SCAS ablations demonstrate higher attack utility under fixed budget on the search environments, but do not include held-out tasks, held-out seeds, or cross-environment transfer tests that would isolate whether the utility gain survives outside the in-distribution search used to compute the feedback signals.
minor comments (1)
  1. [Abstract] The manuscript should clarify the precise definition of “normalized reward drop” and the exact set of baseline attack configurations against which the 0.497 and 0.319 figures were obtained.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments section: the headline gains (0.497→1.034 on DreamerV3 Atari; 0.319→0.682 on DMC) are stated without error bars, standard deviations, number of independent runs, or statistical tests, so it is impossible to determine whether the reported differences exceed run-to-run variability.

    Authors: We agree that the abstract and experiments section should report error bars, the number of independent runs, and any statistical tests to allow assessment of variability. The experiments were run with 5 independent seeds per task, with reported values as means; standard deviations were computed internally. We will revise both the abstract and experiments section to include these details (error bars as standard deviations, explicit run count, and significance tests where applicable). revision: yes

  2. Referee: [Theoretical explanation] Theoretical explanation paragraph: the argument establishes improvement only conditional on the four feedback signals correctly shifting probability mass to genuinely higher-utility attacks; it supplies no bound or analysis showing that reward degradation, action instability, runtime cost, and rollout variability computed on the identical evaluation environments are free of selection bias or environment-specific artifacts (latent dynamics or reward-predictor errors) that would not generalize.

    Authors: The theoretical argument is explicitly conditional on the feedback signals shifting mass toward higher-utility attacks, as described in the manuscript. We do not provide a formal bound or analysis proving the signals are free of selection bias or environment-specific artifacts, since the signals are derived from the same evaluation environments. This is a limitation of the current analysis. We will revise the theoretical section to more clearly articulate the assumptions and note the potential for bias in the feedback signals. revision: yes

  3. Referee: [Ablations] Ablations paragraph: the reported SCAS ablations demonstrate higher attack utility under fixed budget on the search environments, but do not include held-out tasks, held-out seeds, or cross-environment transfer tests that would isolate whether the utility gain survives outside the in-distribution search used to compute the feedback signals.

    Authors: The ablations are intentionally conducted on the search environments to isolate the effect of SCAS and RGAR on attack utility under the finite-budget formulation. We acknowledge that the absence of held-out tasks, additional seeds, or cross-environment transfer limits claims about generalization beyond the in-distribution setting. We will add an explicit discussion of this scope limitation in the ablations section; space permitting, we will also include results on a small set of held-out configurations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains are direct comparisons; theory is conditional and non-reductive.

full rationale

The paper reports normalized reward-drop improvements (0.497→1.034 Atari, 0.319→0.682 DMC) via direct experimental comparison of WMAttack against baselines under fixed budgets. SCAS and RGAR are algorithmic components whose utility is measured empirically rather than derived from fitted parameters. The supplied theoretical argument states only that refinement improves search conditional on shifting mass toward higher-utility attacks; this is a standard conditional guarantee and does not reduce the reported attack strengths or the empirical deltas to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the main unverified premise is the effectiveness of the proposed refinement and retrieval mechanisms.

axioms (1)
  • domain assumption Proposal refinement improves finite-budget search when it shifts probability mass toward high-utility attacks.
    Invoked to justify SCAS; location: abstract theoretical explanation paragraph.

pith-pipeline@v0.9.0 · 5815 in / 1154 out tokens · 20979 ms · 2026-05-25T05:12:57.873209+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 8 internal anchors

  1. [1]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems, 2024

  2. [2]

    Square at- tack: A query-efficient black-box adversarial attack via random search

    Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square at- tack: A query-efficient black-box adversarial attack via random search. InEuropean Conference on Computer Vision, 2020

  3. [3]

    Safe exploration using bayesian world models and log-barrier optimization.arXiv preprint arXiv:2405.05890, 2024

    Yarden As, Bhavya Sukhija, and Andreas Krause. Safe exploration using bayesian world models and log-barrier optimization.arXiv preprint arXiv:2405.05890, 2024

  4. [4]

    Verification of the implicit world model in a generative model via adversarial sequences

    András Balogh and Márk Jelasity. Verification of the implicit world model in a generative model via adversarial sequences. InInternational Conference on Learning Representations,

  5. [5]

    URLhttps://openreview.net/forum?id=BLOIB8CwBI

  6. [6]

    On the robustness of deep reinforcement learning to adversarial attacks

    Vahid Behzadan and Arslan Munir. On the robustness of deep reinforcement learning to adversarial attacks. InMachine Learning and Data Mining in Pattern Recognition: 13th International Conference, MLDM 2017, pages 262–274. Springer, 2017

  7. [7]

    Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

    Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of Artificial Intelligence Research, 47:253–279, 2013

  8. [8]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinash Fu, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of the 7th Annual Conference on Robot Learning (CoRL), 2023. 10

  9. [9]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InInternational Conference on Machine Learning, 2020

  10. [10]

    Minimally distorted adversarial examples with a fast adaptive boundary attack

    Francesco Croce and Matthias Hein. Minimally distorted adversarial examples with a fast adaptive boundary attack. InInternational Conference on Machine Learning, 2020

  11. [11]

    RobustBench: A standardized adversarial robustness benchmark

    Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. RobustBench: A standardized adversarial robustness benchmark. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  12. [12]

    TRAP: Tail-aware Ranking Attack for World-Model Planning

    Siyuan Duan, Ke Zhang, and Xizhao Luo. Trap: Tail-aware ranking attack for world-model planning.arXiv preprint arXiv:2605.01950, 2026

  13. [13]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407. 21783

  14. [14]

    iworld-bench: A benchmark for interactive world models with a unified action generation framework

    Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang, Yongyan Xu, Baining Zhao, Weichen Zhang, Chen Gao, Xinlei Chen, and Yong Li. iworld-bench: A benchmark for interactive world models with a unified action generation framework. InProceedings of the 43rd International Conference on Machine Learning (ICML), 2026. Originally appeared as arXiv prep...

  15. [15]

    Adversarial policies: Attacking deep reinforcement learning

    Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. InInternational Conference on Learning Representations, 2020

  16. [16]

    Goodfellow, Jonathon Shlens, and Christian Szegedy

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver- sarial examples. InInternational Conference on Learning Representations, 2015

  17. [17]

    Copyrightshield: Enhancing diffu- sion model security against copyright infringement attacks

    Zhixiang Guo, Siyuan Liang, Aishan Liu, and Dacheng Tao. Copyrightshield: Enhancing diffu- sion model security against copyright infringement attacks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19417–19426, 2025

  18. [18]

    When world models dream wrong: Physical-conditioned adversarial attacks against world models, 2026

    Zhixiang Guo, Siyuan Liang, Andras Balogh, Noah Lunberry, Rong-Cheng Tu, Mark Jelasity, and Dacheng Tao. When world models dream wrong: Physical-conditioned adversarial attacks against world models, 2026. URLhttps://arxiv.org/abs/2602.18739

  19. [19]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, 2018

  20. [20]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning, 2019

  21. [21]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions, 2020

  22. [22]

    Mastering atari with discrete world models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations, 2021

  23. [23]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  24. [24]

    TD-MPC2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations, 2024

  25. [25]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 11

  26. [26]

    Adversarial attacks on neural network policies

    Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. InInternational Conference on Learning Representations Workshop, 2017

  27. [27]

    Safedreamer: Safe reinforcement learning with world models

    Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, and Yaodong Yang. Safedreamer: Safe reinforcement learning with world models. InInternational Conference on Learning Representations, 2024

  28. [28]

    Lira: Light-robust adversary for model-based reinforcement learning in real world.Robotics and Autonomous Systems, 2025

    Taisuke Kobayashi. Lira: Light-robust adversary for model-based reinforcement learning in real world.Robotics and Autonomous Systems, 2025

  29. [29]

    Universal camouflage attack on vision-language models for autonomous driving.arXiv preprint arXiv:2509.20196, 2025

    Dehong Kong, Sifan Yu, Siyuan Liang, Jiawei Liang, Jianhou Gan, Aishan Liu, and Wenqi Ren. Universal camouflage attack on vision-language models for autonomous driving.arXiv preprint arXiv:2509.20196, 2025

  30. [30]

    Hard-label black-box adversarial attacks for implicit scene interactions.IEEE Transactions on Information Forensics and Security, 20:10346–10360, 2025

    Muxue Liang, Chuan Wang, Siyuan Liang, Aishan Liu, Yanan Cao, Qingyong Li, Zeming Liu, Liang Yang, and Xiaochun Cao. Hard-label black-box adversarial attacks for implicit scene interactions.IEEE Transactions on Information Forensics and Security, 20:10346–10360, 2025

  31. [31]

    A large-scale multiple-objective method for black-box attack against object detection

    Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, Jingzhi Li, Baoyuan Wu, and Xiaochun Cao. A large-scale multiple-objective method for black-box attack against object detection. In European Conference on Computer Vision, 2022

  32. [32]

    Parallel rect- angle flip attack: A query-based black-box attack against object detection.arXiv preprint arXiv:2201.08970, 2022

    Siyuan Liang, Baoyuan Wu, Yanbo Fan, Xingxing Wei, and Xiaochun Cao. Parallel rect- angle flip attack: A query-based black-box attack against object detection.arXiv preprint arXiv:2201.08970, 2022

  33. [33]

    Object detectors in the open environment: Challenges, solutions, and outlook

    Siyuan Liang, Wei Wang, Ruoyu Chen, Aishan Liu, Boxi Wu, Ee-Chien Chang, Xiaochun Cao, and Dacheng Tao. Object detectors in the open environment: Challenges, solutions, and outlook. arXiv preprint arXiv:2403.16271, 2024

  34. [34]

    Tactics of adversarial attack on deep reinforcement learning agents

    Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. InInternational Joint Conference on Artificial Intelligence, 2017

  35. [35]

    Metadv: A unified and interactive adversarial testing platform for autonomous driving

    Aishan Liu, Jiakai Wang, Tianyuan Zhang, Hainan Li, Jiangfan Liu, Siyuan Liang, Yilong Ren, Xianglong Liu, and Dacheng Tao. Metadv: A unified and interactive adversarial testing platform for autonomous driving. InProceedings of the 33rd ACM International Conference on Multimedia, pages 13474–13476, 2025

  36. [36]

    Improving adversarial transferability by stable diffusion.arXiv preprint arXiv:2311.11017, 2023

    Jiayang Liu, Siyu Zhu, Siyuan Liang, Jie Zhang, Han Fang, Weiming Zhang, and Ee- Chien Chang. Improving adversarial transferability by stable diffusion.arXiv preprint arXiv:2311.11017, 2023

  37. [37]

    Bridging the task gap: Multi-task adversarial transferability in clip and its derivatives

    Kuanrong Liu, Siyuan Liang, Cheng Qian, Ming Zhang, and Xiaochun Cao. Bridging the task gap: Multi-task adversarial transferability in clip and its derivatives. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 152–166. Springer, 2025

  38. [38]

    Transformers are sample-efficient world models

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. InInternational Conference on Learning Representations, 2023

  39. [39]

    Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026

    Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026

  40. [40]

    How hard is it to confuse a world model?arXiv preprint arXiv:2510.21232, 2025

    Waris Radji and Odalric-Ambrym Maillard. How hard is it to confuse a world model?arXiv preprint arXiv:2510.21232, 2025

  41. [41]

    Uncertainty-aware latent safety filters for avoiding out-of-distribution failures.arXiv preprint arXiv:2505.00779, 2025

    Junwon Seo, Kensuke Nakamura, and Andrea Bajcsy. Uncertainty-aware latent safety filters for avoiding out-of-distribution failures.arXiv preprint arXiv:2505.00779, 2025

  42. [42]

    Learning latent dynamic robust representations for world models.arXiv preprint arXiv:2405.06263, 2024

    Ruixiang Sun, Hongyu Zang, Xin Li, and Riashat Islam. Learning latent dynamic robust representations for world models.arXiv preprint arXiv:2405.06263, 2024. 12

  43. [43]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite.arXiv preprint arXiv:1801.00690, 2018

  44. [44]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. URL https: //arxiv.org/abs/2412.15115

  45. [45]

    de Melo, and Achuta Kadambi

    Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy, Shruti Satya Narayana, Yunhao Ba, Alex Wong, Celso M. de Melo, and Achuta Kadambi. World- Bench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

  46. [46]

    Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

    Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model. InAdvances in Neural Information Processing Systems, volume 37, 2024

  47. [47]

    Black-box adversarial attack on vision language models for autonomous driving.arXiv preprint arXiv:2501.13563, 2025

    Lu Wang, Tianyuan Zhang, Yang Qu, Siyuan Liang, Yuwei Chen, Aishan Liu, Xianglong Liu, and Dacheng Tao. Black-box adversarial attack on vision language models for autonomous driving.arXiv preprint arXiv:2501.13563, 2025

  48. [48]

    Text adversarial attacks with dynamic outputs.arXiv preprint arXiv:2509.22393, 2025

    Wenqiang Wang, Siyuan Liang, Xiao Yan, and Xiaochun Cao. Text adversarial attacks with dynamic outputs.arXiv preprint arXiv:2509.22393, 2025

  49. [49]

    Diversifying the high-level features for better adversarial transferability.arXiv preprint arXiv:2304.10136, 2023

    Zhiyuan Wang, Zeliang Zhang, Siyuan Liang, and Xiaosen Wang. Diversifying the high-level features for better adversarial transferability.arXiv preprint arXiv:2304.10136, 2023

  50. [50]

    Transferable Adversarial Attacks for Image and Video Object Detection

    Xingxing Wei, Siyuan Liang, Ning Chen, and Xiaochun Cao. Transferable adversarial attacks for image and video object detection.arXiv preprint arXiv:1811.12641, 2018

  51. [51]

    Ctrlattack: A unified attack on world-model control in diffusion models.arXiv preprint arXiv:2603.13435, 2026

    Shuhan Xu, Siyuan Liang, Hongling Zheng, Yong Luo, Han Hu, Lefei Zhang, and Dacheng Tao. Ctrlattack: A unified attack on world-model control in diffusion models.arXiv preprint arXiv:2603.13435, 2026

  52. [52]

    Learning interactive real-world simulators

    Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations, 2024

  53. [53]

    Towards robust model-based reinforce- ment learning against adversarial corruption.arXiv preprint arXiv:2402.08991, 2024

    Chenlu Ye, Jiafan He, Quanquan Gu, and Tong Zhang. Towards robust model-based reinforce- ment learning against adversarial corruption.arXiv preprint arXiv:2402.08991, 2024

  54. [54]

    Learning invariant representations for reinforcement learning without reconstruction

    Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. InInternational Conference on Learning Representations (ICLR), 2021

  55. [55]

    Safe deep model-based reinforcement learning with lyapunov functions.arXiv preprint arXiv:2405.16184, 2024

    Harry Zhang. Safe deep model-based reinforcement learning with lyapunov functions.arXiv preprint arXiv:2405.16184, 2024

  56. [56]

    Robust deep reinforcement learning against adversarial perturbations on state observations

    Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, 2020

  57. [57]

    Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

    Tianyuan Zhang, Lu Wang, Xinwei Zhang, Yitong Zhang, Boyi Jia, Siyuan Liang, Shengshan Hu, Qiang Fu, Aishan Liu, and Xianglong Liu. Visual adversarial attack on vision-language models for autonomous driving.arXiv preprint arXiv:2411.18275, 2024

  58. [58]

    STORM: Efficient stochastic transformer based world models for reinforcement learning.arXiv preprint arXiv:2310.09615, 2024

    Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. STORM: Efficient stochastic transformer based world models for reinforcement learning.arXiv preprint arXiv:2310.09615, 2024

  59. [59]

    World model robustness via surprise recognition.arXiv preprint arXiv:2512.01119, 2025

    Geigh Zollicoffer, Tanush Chopra, Mingkuan Yan, Xiaoxu Ma, Kenneth Eaton, and Mark Riedl. World model robustness via surprise recognition.arXiv preprint arXiv:2512.01119, 2025. 13 A Victim and Threat Model Details This appendix details the closed-loop evaluation process under clean and attacked observations. The victim is a fixed world-model agent Mθ = (ϕ...