pith. machine review for the scientific record. sign in

arxiv: 2604.01346 · v2 · submitted 2026-04-01 · 💻 cs.CR · cs.AI· cs.LG· cs.RO

Recognition: 3 theorem links

· Lean Theorem

Safety, Security, and Cognitive Risks in World Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:31 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LGcs.RO
keywords world modelstrajectory persistencerepresentational riskattacker taxonomyadversarial attacksAI safetyautonomous agentsrollout errors
0
0 comments X

The pith

World models introduce trajectory persistence and representational risks that let adversaries degrade safety-critical AI agents through data corruption and rollout errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that world models, which predict future states in latent space for efficient planning, create new attack surfaces where corrupted training data or poisoned representations can compound into large performance drops. It defines trajectory persistence as the property that allows adversarial perturbations to persist across rollouts and representational risk as the vulnerability of the compressed internal state. A five-profile attacker taxonomy unifies threats from training-time poisoning to inference-time exploitation and cognitive effects like automation bias. The work demonstrates these effects empirically on RSSM and DreamerV3 models and concludes that world models therefore require the engineering discipline applied to flight-control or medical software.

Core claim

World models require the same rigour as flight-control software or medical devices because adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause significant degradation in safety-critical deployments, while also enabling goal misgeneralisation, deceptive alignment, and human miscalibration of trust.

What carries the argument

Formal definitions of trajectory persistence and representational risk, together with a five-profile attacker taxonomy unified under MITRE ATLAS and OWASP LLM Top 10.

If this is right

  • Adversarial fine-tuning on GRU-based RSSMs produces 2.26 times amplification of attack effects and 59.5 percent reward reduction.
  • Stochastic RSSM proxies exhibit lower attack amplification at 0.65 times, indicating architecture dependence.
  • Real DreamerV3 checkpoints already exhibit non-zero action drift under the same attack patterns.
  • Alignment-layer risks such as goal misgeneralisation and reward hacking become more feasible once persistent trajectories are available.
  • Human operators face automation bias and planning hallucination when relying on authoritative world-model predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing safety benchmarks that test only final actions may miss the compounding effects that arise specifically from persistent world-model errors.
  • The same taxonomy could be applied to test whether language-model world models in agentic systems inherit similar persistence properties.
  • Governance frameworks such as the EU AI Act may need explicit clauses for latent-state integrity in addition to output safety.
  • Interdisciplinary teams combining adversarial ML and control-theory verification could develop quantitative bounds on acceptable rollout error.

Load-bearing premise

The introduced formal definitions of trajectory persistence and representational risk, along with the five-profile attacker taxonomy, accurately and comprehensively capture the distinctive risks in world model-equipped agents.

What would settle it

A controlled experiment on a deployed world-model agent that shows no measurable increase in trajectory drift or reward degradation after systematic attempts to poison the training set and inject rollout perturbations.

Figures

Figures reproduced from arXiv: 2604.01346 by Manoj Parmar.

Figure 1
Figure 1. Figure 1: Trajectory-Persistent Adversarial Attack Experiment (V3 Results): Core Results. (A) [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trajectory-Persistent Attack Experiment: Mitigation and Reward-Gap Results. (D) [PITH_FULL_IMAGE:figures/full_fig_p029_2.png] view at source ↗
read the original abstract

World models - learned internal simulators of environment dynamics - are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. By predicting future states in compressed latent spaces, they enable sample-efficient planning and long-horizon imagination without direct environment interaction. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause significant degradation in safety-critical deployments. At the alignment layer, world model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking. At the human layer, authoritative world model predictions foster automation bias, miscalibrated trust, and planning hallucination. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five-profile attacker taxonomy; and develops a unified threat model drawing on MITRE ATLAS and the OWASP LLM Top 10. We provide an empirical proof-of-concept demonstrating trajectory-persistent adversarial attacks on a GRU-based RSSM ($\mathcal{A}_1 = 2.26\times$ amplification, $-59.5\%$ reward reduction under adversarial fine-tuning), validate architecture-dependence via a stochastic RSSM proxy ($\mathcal{A}_1 = 0.65\times$), and probe a real DreamerV3 checkpoint (non-zero action drift confirmed). We propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human-factors design, arguing that world models require the same rigour as flight-control software or medical devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper surveys the landscape of world models in autonomous agents, introduces formal definitions of trajectory persistence and representational risk, presents a five-profile attacker taxonomy, and develops a unified threat model integrating MITRE ATLAS and OWASP LLM Top 10. It reports an empirical proof-of-concept on GRU-based RSSM showing 2.26x amplification and -59.5% reward reduction under adversarial fine-tuning, with validation on a stochastic RSSM proxy (0.65x) and a DreamerV3 checkpoint confirming non-zero action drift. The central argument is that these models create distinctive safety, security, and cognitive risks requiring rigour comparable to flight-control software or medical devices, with proposed mitigations spanning adversarial hardening, alignment, and governance.

Significance. If the formal definitions and taxonomy are shown to isolate vulnerabilities beyond standard RL error metrics and existing adversarial frameworks, the work would establish a necessary foundation for treating world models as high-stakes components in robotics and agentic systems. The empirical POC provides concrete evidence of degradation under poisoning and rollout attacks, supporting calls for interdisciplinary safeguards aligned with NIST AI RMF and EU AI Act.

major comments (3)
  1. [formal definitions section] Section introducing formal definitions of trajectory persistence and representational risk: the manuscript does not include an ablation demonstrating that these constructs predict degradation better than baseline rollout variance or compounding error norms; without this, the claim that they capture distinctive risks (beyond reframing known poisoning/rollout attacks) remains unestablished and load-bearing for the central argument.
  2. [empirical POC section] Empirical proof-of-concept section reporting GRU-RSSM results (A1 = 2.26x amplification, -59.5% reward reduction): the abstract and results lack methodology details including data exclusion rules, error bars, statistical tests, and verification procedures for the GRU-based RSSM and DreamerV3 experiments, preventing assessment of whether the observed effects exceed standard adversarial RL baselines.
  3. [attacker taxonomy and threat model section] Section on the five-profile attacker taxonomy and unified threat model: the taxonomy is not shown to exhaust or extend vectors already covered by MITRE ATLAS; an explicit mapping or gap analysis is required to substantiate that the framework adds distinctive coverage rather than overlapping with existing adversarial RL techniques.
minor comments (2)
  1. [abstract] The abstract reports precise numerical results (e.g., 2.26x, -59.5%) without accompanying confidence intervals or baseline comparisons, which reduces clarity for readers assessing the magnitude of effects.
  2. [empirical results] Notation for A1 amplification and reward reduction should be defined explicitly on first use with reference to the underlying equations for trajectory persistence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the work.

read point-by-point responses
  1. Referee: [formal definitions section] Section introducing formal definitions of trajectory persistence and representational risk: the manuscript does not include an ablation demonstrating that these constructs predict degradation better than baseline rollout variance or compounding error norms; without this, the claim that they capture distinctive risks (beyond reframing known poisoning/rollout attacks) remains unestablished and load-bearing for the central argument.

    Authors: We agree that an ablation would provide stronger empirical grounding for the claim that trajectory persistence and representational risk isolate vulnerabilities beyond standard RL metrics. The manuscript is structured as a survey with targeted proof-of-concept demonstrations rather than a full comparative empirical study. In the revision we will add a dedicated discussion subsection that (i) formally contrasts the new constructs with rollout variance and compounding error norms and (ii) reports a limited post-hoc comparison using the existing GRU-RSSM and DreamerV3 data to illustrate where the new metrics diverge. A comprehensive ablation study is beyond the current scope but will be noted as future work. revision: partial

  2. Referee: [empirical POC section] Empirical proof-of-concept section reporting GRU-RSSM results (A1 = 2.26x amplification, -59.5% reward reduction): the abstract and results lack methodology details including data exclusion rules, error bars, statistical tests, and verification procedures for the GRU-based RSSM and DreamerV3 experiments, preventing assessment of whether the observed effects exceed standard adversarial RL baselines.

    Authors: We accept that the current presentation of the empirical results is insufficiently detailed for reproducibility and baseline comparison. The revised manuscript will expand the empirical section to include: data exclusion criteria, error bars with standard deviations across repeated runs, statistical significance tests, and explicit verification procedures for both the GRU-based RSSM and DreamerV3 checkpoint experiments. These additions will allow readers to evaluate the results against standard adversarial RL baselines. revision: yes

  3. Referee: [attacker taxonomy and threat model section] Section on the five-profile attacker taxonomy and unified threat model: the taxonomy is not shown to exhaust or extend vectors already covered by MITRE ATLAS; an explicit mapping or gap analysis is required to substantiate that the framework adds distinctive coverage rather than overlapping with existing adversarial RL techniques.

    Authors: We agree that an explicit mapping is required to substantiate the added value of the taxonomy. The revised manuscript will include a new table that maps each of the five attacker profiles to the relevant MITRE ATLAS techniques, together with a gap analysis. The analysis will highlight extensions in the areas of cognitive risk, goal misgeneralization, and human-factor biases that are not the primary focus of existing adversarial RL frameworks, thereby clarifying how the unified threat model integrates and extends prior work. revision: yes

Circularity Check

0 steps flagged

New formal definitions and taxonomy introduced as independent constructs; no reduction to self-referential inputs

full rationale

The paper introduces formal definitions of trajectory persistence and representational risk plus a five-profile attacker taxonomy as novel constructs, then applies them to survey risks and report empirical POC on GRU-RSSM and DreamerV3. These definitions are presented as new rather than derived from prior fitted parameters or self-citations that reduce to the target claims. The unified threat model explicitly draws on external sources (MITRE ATLAS, OWASP LLM Top 10) and NIST/EU frameworks. No equations, fitted inputs, or self-citation chains are shown to force the central claim by construction; the empirical results (A1 amplification, reward reduction) stand as separate validation. This yields a minor self-citation load at most (score 2) while remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities are identifiable. The new definitions of trajectory persistence and representational risk function as introduced concepts whose grounding cannot be verified without full text.

pith-pipeline@v0.9.0 · 5588 in / 1238 out tokens · 99995 ms · 2026-05-13T22:31:52.596969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 6 internal anchors

  1. [1]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. URL https: //arxiv.org/abs/1803.10122

  2. [2]

    Mastering diverse control tasks through world models.Nature, 640:647–653, 2025

    Danijar Hafner, Jurgis Pašukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640:647–653, 2025. URL https://www.nature.com/articles/ s41586-025-08744-2

  3. [3]

    A path towards autonomous machine intelligence

    Yann LeCun. A path towards autonomous machine intelligence. OpenReview, 2022. URL https: //openreview.net/pdf?id=BZ5a1r-kVsf

  4. [4]

    Model-based imitation learning for urban driving

    Anthony Hu, Gianluca Corrado, Nicolas Griffiths, et al. Model-based imitation learning for urban driving. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/ 2210.07729

  5. [5]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, et al. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. URLhttps://arxiv.org/abs/2309.17080

  6. [6]

    DriveDreamer: Towards real-world-driven world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, et al. DriveDreamer: Towards real-world-driven world models for autonomous driving. InEuropean Conference on Computer Vision (ECCV), 2024. URL https://arxiv. org/abs/2309.09777

  7. [7]

    arXiv preprint arXiv:2310.061141(2), 6 (2023)

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, et al. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310.06114

  8. [8]

    Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100, 2025

    Chenhao Li, Andreas Krause, and Marco Hutter. Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100, 2025. URL https://arxiv.org/abs/ 2501.10100

  9. [9]

    Genie: Generative interactive environments, 2024

    Jake Bruce, Michael Dennis, Ashley Edwards, et al. Genie: Generative interactive environments.arXiv preprint arXiv:2402.15391, 2024. URLhttps://arxiv.org/abs/2402.15391

  10. [10]

    Understanding world or predicting future? A comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024

    Jingtao Ding, Yunke Zhang, Yu Shang, Jie Feng, Yuheng Zhang, Zefang Zong, Yuan Yuan, Hongyuan Su, Nian Li, Jinghua Piao, Yucheng Deng, Nicholas Sukiennik, Chen Gao, Fengli Xu, and Yong Li. Understanding world or predicting future? A comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024. URL https://arxiv.org/abs/2411.14499

  11. [11]

    A comprehensive survey on world models for embodied AI.arXiv preprintarXiv:2510.16732, 2025

    Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied AI.arXiv preprint arXiv:2510.16732, 2024. URLhttps://arxiv.org/abs/2510.16732

  12. [12]

    Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

    Nan Jiang, Alex Kulesza, and Satinder Singh. Hallucinating value: A pitfall of Dyna-style planning with imperfect world models.arXiv preprint arXiv:2006.04363, 2020. URLhttps://arxiv.org/abs/2006.04363

  13. [13]

    Four principles for physically interpretable world models.arXiv preprint arXiv:2503.02143, 2025

    Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. Four principles for physically interpretable world models.arXiv preprint arXiv:2503.02143, 2025. URL https://arxiv.org/abs/ 2503.02143

  14. [14]

    World models: The safety perspective.arXiv preprint arXiv:2411.07690, 2024

    Zifan Zeng, Chongzhe Zhang, Feng Liu, Joseph Sifakis, Qunli Zhang, Shiming Liu, and Peng Wang. World models: The safety perspective.arXiv preprint arXiv:2411.07690, 2024. URL https://arxiv.org/abs/ 2411.07690. 24 Safety, Security, and Cognitive Risks in World Models

  15. [15]

    Deception in reinforced autonomous agents: The unconventional rabbit hat trick in legislation.arXiv preprint arXiv:2405.04325, 2024

    Atharvan Dogra, Ameet Deshpande, John Nay, Tanmay Rajpurohit, Ashwin Kalyan, and Balaraman Ravindran. Deception in reinforced autonomous agents: The unconventional rabbit hat trick in legislation.arXiv preprint arXiv:2405.04325, 2024. URLhttps://arxiv.org/abs/2405.04325

  16. [16]

    Dynamic human trust modeling of autonomous agents with varying capability and strategy.arXiv preprint arXiv:2404.19291, 2024

    Jason Dekarske, Zhaodan Kong, and Sanjay Joshi. Dynamic human trust modeling of autonomous agents with varying capability and strategy.arXiv preprint arXiv:2404.19291, 2024. URL https://arxiv.org/abs/ 2404.19291

  17. [17]

    MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems

    MITRE Corporation. MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems. MITRE Corporation, 2024. URLhttps://atlas.mitre.org/

  18. [18]

    OW ASP top 10 for LLM applications

    OW ASP Foundation. OW ASP top 10 for LLM applications. OW ASP Foundation, 2025. URLhttps://owasp. org/www-project-top-10-for-large-language-model-applications/

  19. [19]

    Robust deep reinforcement learning against adversarial perturbations on state observations

    Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2003.08938

  20. [20]

    Robust deep reinforcement learning with adaptive adversarial perturbations in action space.arXiv preprint arXiv:2405.11982, 2024

    Qianmei Liu, Yufei Kuang, and Jie Wang. Robust deep reinforcement learning with adaptive adversarial perturbations in action space.arXiv preprint arXiv:2405.11982, 2024. URL https://arxiv.org/abs/ 2405.11982

  21. [21]

    When world models dream wrong: Physical-conditioned adversarial attacks against world models.arXiv preprint arXiv:2602.18739, 2026

    Zhixiang Guo, Siyuan Liang, Andras Balogh, Noah Lunberry, Rong-Cheng Tu, Mark Jelasity, and Dacheng Tao. When world models dream wrong: Physical-conditioned adversarial attacks against world models.arXiv preprint arXiv:2602.18739, 2026. URLhttps://arxiv.org/abs/2602.18739

  22. [22]

    Explaining and Harnessing Adversarial Examples

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (ICLR 2015), 2015. URLhttps://arxiv.org/abs/1412.6572

  23. [23]

    Deep learning adversarial attacks and defenses in autonomous vehicles

    Fawzi Boumazouza et al. Deep learning adversarial attacks and defenses in autonomous vehicles. Artificial Intelligence Review, 2024. URL https://link.springer.com/article/10.1007/ s10462-024-11014-8

  24. [24]

    Data poisoning in deep learning: A survey.arXiv preprint arXiv:2503.22759, 2025

    Pinlong Zhao, Weiyao Zhu, Pengfei Jiao, Di Gao, and Ou Wu. Data poisoning in deep learning: A survey.arXiv preprint arXiv:2503.22759, 2025. URLhttps://arxiv.org/abs/2503.22759

  25. [25]

    arXiv preprint arXiv:1906.01820 , year =

    Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems.arXiv preprint arXiv:1906.01820, 2019. URL https: //arxiv.org/abs/1906.01820

  26. [26]

    Sharkey, Jacob Pfau, and David Krueger

    Lauro Langosco di Langosco, Jack Koch, Lee D. Sharkey, Jacob Pfau, and David Krueger. Goal misgeneralization in deep reinforcement learning. InProceedings of the 39th International Conference on Machine Learning (ICML),

  27. [27]

    URLhttps://arxiv.org/abs/2105.14111

  28. [28]

    Specification gaming: The flip side of AI ingenuity

    Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Ku- mar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: The flip side of AI ingenuity. DeepMind Blog, 2020. URL https://deepmind.google/discover/blog/ specification-gaming-the-flip-side-of-ai-ingenuity/

  29. [29]

    The alignment problem from a deep learning perspective

    Richard Ngo, Lawrence Chan, and Soren Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022. URLhttps://arxiv.org/abs/2209.00626

  30. [30]

    SafeDreamer: Safe reinforcement learning with world models

    Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, and Yaodong Yang. SafeDreamer: Safe reinforcement learning with world models. InInternational Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2307.07176. 25 Safety, Security, and Cognitive Risks in World Models

  31. [31]

    Mopo: Model-Based Offline Policy Optimization

    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2005.13239

  32. [32]

    MOReL: Model-based offline reinforcement learning

    Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL: Model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URL https://arxiv.org/abs/2005.05951

  33. [33]

    COMBO: Conservative offline model-based policy optimization

    Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative offline model-based policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. URLhttps://arxiv.org/abs/2102.08363

  34. [34]

    Deep reinforcement learning policies learn shared adversarial features across MDPs

    Ezgi Korkmaz. Deep reinforcement learning policies learn shared adversarial features across MDPs. InProceedings of the AAAI Conference on Artificial Intelligence, 2022

  35. [35]

    Adversarial robust deep reinforcement learning requires redefining robustness

    Ezgi Korkmaz. Adversarial robust deep reinforcement learning requires redefining robustness. InProceedings of the AAAI Conference on Artificial Intelligence, 2023

  36. [36]

    Detecting adversarial directions in deep reinforcement learning to make robust decisions

    Ezgi Korkmaz et al. Detecting adversarial directions in deep reinforcement learning to make robust decisions. In Proceedings of the International Conference on Machine Learning, 2023

  37. [37]

    Understanding and diagnosing deep reinforcement learning

    Ezgi Korkmaz. Understanding and diagnosing deep reinforcement learning. InProceedings of the International Conference on Machine Learning, 2024

  38. [38]

    How to lose inherent counterfactuality in reinforcement learning

    Ezgi Korkmaz. How to lose inherent counterfactuality in reinforcement learning. InInternational Conference on Learning Representations, 2026

  39. [39]

    Zico Kolter and Eric Wong

    J. Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. InProceedings of the International Conference on Machine Learning, 2018

  40. [40]

    Certified robustness to adversarial examples with differential privacy

    Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. InIEEE Symposium on Security and Privacy, 2019

  41. [41]

    Zico Kolter

    Jeremy Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. In Proceedings of the International Conference on Machine Learning, 2019

  42. [42]

    Provably robust deep learning via adversarially trained smoothed classifiers

    Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sébastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. InAdvances in Neural Information Processing Systems, 2019

  43. [43]

    Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li

    Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. InProceedings of the International Conference on Machine Learning, 2020

  44. [44]

    Towards realistic guarantees: A probabilistic certificate for Smooth- LLM, 2025

    Adarsh Kumarappan and Ayushi Mehrotra. Towards realistic guarantees: A probabilistic certificate for Smooth- LLM, 2025

  45. [45]

    Humans and automation: Use, misuse, disuse, abuse

    Raja Parasuraman and Victor Riley. Humans and automation: Use, misuse, disuse, abuse.Human Factors, 39(2): 230–253, 1997. URLhttps://doi.org/10.1518/001872097778543886

  46. [46]

    Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2:665–673,

  47. [47]

    URLhttps://doi.org/10.1038/s42256-020-00257-z

  48. [48]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. URLhttps://arxiv.org/abs/1809.01999. 26 Safety, Security, and Cognitive Risks in World Models

  49. [49]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. URL https://arxiv.org/abs/1912. 01603

  50. [50]

    Sora: Creating video from text

    OpenAI. Sora: Creating video from text. OpenAI Technical Report, 2024. URL https://openai.com/ index/sora/

  51. [51]

    LLM-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv preprint arXiv:2509.18970, 2025

    Xixun Lin, Yucheng Ning, Jingwen Zhang, Yan Dong, Yilong Liu, Yongxuan Wu, Xiaohua Qi, Nan Sun, Yanmin Shang, Pengfei Cao, Lixin Zou, Xu Chen, Chuan Zhou, Jia Wu, Shirui Pan, Bin Wang, Yanan Cao, Kai Chen, Songlin Hu, and Li Guo. LLM-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv preprint arXiv:2509.18970, 202...

  52. [52]

    The foundry problem: World models and the missing liabil- ity framework for self-supervised learning

    Stanford CodeX. The foundry problem: World models and the missing liabil- ity framework for self-supervised learning. Stanford Center for Legal Informat- ics (CodeX) Blog, 2026. URL https://law.stanford.edu/2026/03/06/ the-foundry-problem-world-models-and-the-missing-liability-framework-for-self-supervised-learning/

  53. [53]

    Poisoning attacks against machine learning

    NIST. Poisoning attacks against machine learning. NIST Technical Report, 2022. URL https://tsapps. nist.gov/publication/get_pdf.cfm?pub_id=934932

  54. [54]

    Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

    Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.arXiv preprint arXiv:2502.05206, 2025. URLhttps://arxiv.org/abs/2502.05206

  55. [55]

    Automation bias in human-AI collaboration: A review.AI & Society, 2025

    Bochao Zou et al. Automation bias in human-AI collaboration: A review.AI & Society, 2025. URL https: //link.springer.com/article/10.1007/s00146-025-02422-7

  56. [56]

    Robust multi-agent reinforcement learning against adversarial attacks for cooperative self- driving vehicles.IET Radar, Sonar & Navigation, 2025

    Guoxin Wang et al. Robust multi-agent reinforcement learning against adversarial attacks for cooperative self- driving vehicles.IET Radar, Sonar & Navigation, 2025. URL https://ietresearch.onlinelibrary. wiley.com/doi/10.1049/rsn2.70033

  57. [57]

    Talvitie, Michael Bowling, and Martha White

    Farzane Aminmansour, Taher Jafferjee, Ehsan Imani, Erin J. Talvitie, Michael Bowling, and Martha White. Mitigating value hallucination in Dyna-style planning via multistep predecessor models.Journal of Artificial In- telligence Research, 80:441–473, 2024. URL https://www.jair.org/index.php/jair/article/ view/15155

  58. [58]

    An empirical study on hallucinations in embodied agents

    Sinan Zeng et al. An empirical study on hallucinations in embodied agents. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2025, 2025. URL https://aclanthology.org/2025. findings-emnlp.1158.pdf

  59. [59]

    Lyapunov density models: Constraining distribution shift in learning-based control

    Katie Kang, Paula Gradu, Jason Choi, Michael Janner, Claire Tomlin, and Sergey Levine. Lyapunov density models: Constraining distribution shift in learning-based control. InProceedings of the 39th International Conference on Machine Learning (ICML), 2022. URLhttps://arxiv.org/abs/2206.10524

  60. [60]

    Bounding distributional shifts in world modeling through novelty detection

    Eric Jing and Abdeslam Boularias. Bounding distributional shifts in world modeling through novelty detection. arXiv preprint arXiv:2508.06096, 2025. URLhttps://arxiv.org/abs/2508.06096

  61. [61]

    A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

    Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025. URLhttps://arxiv.org/abs/2501.11260

  62. [62]

    AdvSim: Generating safety-critical scenarios for self-driving vehicles

    Jingkang Wang, Ava Pun, James Tu, et al. AdvSim: Generating safety-critical scenarios for self-driving vehicles. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. URL https: //openaccess.thecvf.com/content/CVPR2021/papers/Wang_AdvSim_Generating_ Safety-Critical_Scenarios_for_Self-Driving_Vehicles_CVPR_2021_paper.pdf. 27 Safety...

  63. [63]

    A survey on model extraction attacks and defenses for large language models.arXiv preprint arXiv:2506.22521, 2025

    Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, and Yushun Dong. A survey on model extraction attacks and defenses for large language models.arXiv preprint arXiv:2506.22521, 2025. URL https://arxiv.org/abs/2506.22521

  64. [64]

    Algorithms that remember: Model inversion attacks and data protection law.Philosophical Transactions of the Royal Society A, 376(2133):20180083, 2018

    Michael Veale, Reuben Binns, and Lilian Edwards. Algorithms that remember: Model inversion attacks and data protection law.Philosophical Transactions of the Royal Society A, 376(2133):20180083, 2018. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC6191664/

  65. [65]

    Privacy leakage on DNNs: A survey of model inversion attacks and defenses.arXiv preprint arXiv:2402.04013, 2024

    Hao Fang, Yixiang Qiu, Hongyao Yu, Wenbo Yu, Jiawei Kong, Baoli Chong, Bin Chen, Xuan Wang, Shu-Tao Xia, and Ke Xu. Privacy leakage on DNNs: A survey of model inversion attacks and defenses.arXiv preprint arXiv:2402.04013, 2024. URLhttps://arxiv.org/abs/2402.04013

  66. [66]

    Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges.arXiv preprint arXiv:2507.19672, 2025

    Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, et al. Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges.arXiv preprint arXiv:2507.19672, 2025. URL https://arxiv.org/abs/2507. 19672

  67. [67]

    Misalignment or misuse? The AGI alignment tradeoff.arXiv preprint arXiv:2506.03755, 2025

    Max Hellrigel-Holderbaum and Leonard Dung. Misalignment or misuse? The AGI alignment tradeoff.arXiv preprint arXiv:2506.03755, 2025. URLhttps://arxiv.org/abs/2506.03755

  68. [68]

    A survey on progress in LLM alignment from the perspective of reward design.arXiv preprint arXiv:2505.02666, 2025

    Miaomiao Ji, Yanqiu Wu, Zhibin Wu, Shoujin Wang, Jian Yang, Mark Dras, and Usman Naseem. A survey on progress in LLM alignment from the perspective of reward design.arXiv preprint arXiv:2505.02666, 2025. URL https://arxiv.org/abs/2505.02666

  69. [69]

    Current agents fail to leverage world model as tool for foresight.arXiv preprint arXiv:2601.03905, 2026

    Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani- Tür, Gokhan Tur, Yunzhu Li, et al. Current agents fail to leverage world model as tool for foresight.arXiv preprint arXiv:2601.03905, 2026. URLhttps://arxiv.org/abs/2601.03905

  70. [70]

    When counterfactual reasoning fails: Chaos and real-world complexity.arXiv preprint arXiv:2503.23820, 2025

    Yahya Aalaila, Gerrit Großmann, Sumantrak Mukherjee, Jonas Wahl, and Sebastian V ollmer. When counterfactual reasoning fails: Chaos and real-world complexity.arXiv preprint arXiv:2503.23820, 2025. URL https: //arxiv.org/abs/2503.23820

  71. [71]

    Risk alignment in agentic AI systems.arXiv preprint arXiv:2410.01927, 2024

    Hayley Clatterbuck, Clinton Castro, and Arvo Muñoz Morán. Risk alignment in agentic AI systems.arXiv preprint arXiv:2410.01927, 2024. URLhttps://arxiv.org/abs/2410.01927

  72. [72]

    Artificial intelligence risk management framework (AI RMF 1.0), NIST AI 100-1

    National Institute of Standards and Technology. Artificial intelligence risk management framework (AI RMF 1.0), NIST AI 100-1. Technical report, NIST, 2023. URLhttps://doi.org/10.6028/NIST.AI.100-1

  73. [73]

    Artificial intelligence risk management framework: Generative artificial intelligence profile (NIST AI 600-1)

    National Institute of Standards and Technology. Artificial intelligence risk management framework: Generative artificial intelligence profile (NIST AI 600-1). Technical report, NIST, 2024. URL https://nvlpubs.nist. gov/nistpubs/ai/NIST.AI.600-1.pdf

  74. [74]

    Regulation (EU) 2024/1689 of the european parliament and of the council — artificial intelligence act

    European Union. Regulation (EU) 2024/1689 of the european parliament and of the council — artificial intelligence act. Official Journal of the European Union, 2024. URL https://eur-lex.europa.eu/eli/reg/ 2024/1689/oj/eng. 28 Safety, Security, and Cognitive Risks in World Models A Supplementary Figures 5 10 15 20 25 30 Rollout step k 0.00 0.01 0.02 0.03 0....