arxiv: 2604.01346 · v2 · submitted 2026-04-01 · 💻 cs.CR · cs.AI· cs.LG· cs.RO

Recognition: 3 theorem links

· Lean Theorem

Safety, Security, and Cognitive Risks in World Models

Manoj Parmar

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:31 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LGcs.RO

keywords world modelstrajectory persistencerepresentational riskattacker taxonomyadversarial attacksAI safetyautonomous agentsrollout errors

0 comments

The pith

World models introduce trajectory persistence and representational risks that let adversaries degrade safety-critical AI agents through data corruption and rollout errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that world models, which predict future states in latent space for efficient planning, create new attack surfaces where corrupted training data or poisoned representations can compound into large performance drops. It defines trajectory persistence as the property that allows adversarial perturbations to persist across rollouts and representational risk as the vulnerability of the compressed internal state. A five-profile attacker taxonomy unifies threats from training-time poisoning to inference-time exploitation and cognitive effects like automation bias. The work demonstrates these effects empirically on RSSM and DreamerV3 models and concludes that world models therefore require the engineering discipline applied to flight-control or medical software.

Core claim

World models require the same rigour as flight-control software or medical devices because adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause significant degradation in safety-critical deployments, while also enabling goal misgeneralisation, deceptive alignment, and human miscalibration of trust.

What carries the argument

Formal definitions of trajectory persistence and representational risk, together with a five-profile attacker taxonomy unified under MITRE ATLAS and OWASP LLM Top 10.

If this is right

Adversarial fine-tuning on GRU-based RSSMs produces 2.26 times amplification of attack effects and 59.5 percent reward reduction.
Stochastic RSSM proxies exhibit lower attack amplification at 0.65 times, indicating architecture dependence.
Real DreamerV3 checkpoints already exhibit non-zero action drift under the same attack patterns.
Alignment-layer risks such as goal misgeneralisation and reward hacking become more feasible once persistent trajectories are available.
Human operators face automation bias and planning hallucination when relying on authoritative world-model predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing safety benchmarks that test only final actions may miss the compounding effects that arise specifically from persistent world-model errors.
The same taxonomy could be applied to test whether language-model world models in agentic systems inherit similar persistence properties.
Governance frameworks such as the EU AI Act may need explicit clauses for latent-state integrity in addition to output safety.
Interdisciplinary teams combining adversarial ML and control-theory verification could develop quantitative bounds on acceptable rollout error.

Load-bearing premise

The introduced formal definitions of trajectory persistence and representational risk, along with the five-profile attacker taxonomy, accurately and comprehensively capture the distinctive risks in world model-equipped agents.

What would settle it

A controlled experiment on a deployed world-model agent that shows no measurable increase in trajectory drift or reward degradation after systematic attempts to poison the training set and inject rollout perturbations.

Figures

Figures reproduced from arXiv: 2604.01346 by Manoj Parmar.

**Figure 2.** Figure 2: Trajectory-Persistent Attack Experiment: Mitigation and Reward-Gap Results. (D) [PITH_FULL_IMAGE:figures/full_fig_p029_2.png] view at source ↗

read the original abstract

World models - learned internal simulators of environment dynamics - are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. By predicting future states in compressed latent spaces, they enable sample-efficient planning and long-horizon imagination without direct environment interaction. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause significant degradation in safety-critical deployments. At the alignment layer, world model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking. At the human layer, authoritative world model predictions foster automation bias, miscalibrated trust, and planning hallucination. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five-profile attacker taxonomy; and develops a unified threat model drawing on MITRE ATLAS and the OWASP LLM Top 10. We provide an empirical proof-of-concept demonstrating trajectory-persistent adversarial attacks on a GRU-based RSSM ($\mathcal{A}_1 = 2.26\times$ amplification, $-59.5\%$ reward reduction under adversarial fine-tuning), validate architecture-dependence via a stochastic RSSM proxy ($\mathcal{A}_1 = 0.65\times$), and probe a real DreamerV3 checkpoint (non-zero action drift confirmed). We propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human-factors design, arguing that world models require the same rigour as flight-control software or medical devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper surveys risks in world models and adds formal definitions plus a taxonomy, but the new terms aren't shown to be necessary beyond standard RL metrics.

read the letter

World models are getting attention for their role in planning, and this paper flags some real safety and security issues that come with them. The core contribution is a set of formal definitions for trajectory persistence and representational risk, plus a five-profile attacker taxonomy that extends existing frameworks like MITRE ATLAS and OWASP LLM Top 10. It also includes a quick empirical check on GRU-based RSSM and DreamerV3 showing degradation under adversarial fine-tuning, with the reported 2.26x amplification and 59.5% reward reduction as the main concrete numbers. The survey ties data corruption, latent poisoning, and rollout errors to safety-critical use cases in robotics and agents, and it sketches mitigations that reach into governance and human factors. That breadth is useful for pulling scattered concerns into one place. The empirical POC at least demonstrates that attacks can be run on real checkpoints and produce measurable drift, which gives readers something tangible to test against their own setups. The cognitive risks section on automation bias and planning hallucination also points to deployment issues that often get less attention than pure technical attacks. The soft spots sit mainly in the formal layer. The paper does not include an ablation or direct comparison showing that trajectory persistence or representational risk predict problems better than ordinary rollout variance or compounding error norms already used in RL. Without that separation, the new definitions risk looking like relabeling rather than a clear advance. The results section reports specific numbers but gives little on methodology details such as training splits, variance across runs, or how the stochastic RSSM proxy was constructed, so the strength of the architecture-dependence claim is hard to judge from the given evidence. The call for flight-control-level rigor is reasonable in tone but rests on those unproven distinctions holding up. This paper suits readers working on agentic systems or autonomous vehicles who want a consolidated overview of the risk surface rather than a deep theoretical advance. Someone building threat models or running their own world-model experiments could borrow the taxonomy or the POC setup as a starting checklist. It has enough structure and a concrete demonstration to deserve a serious referee, though the formal claims would need clearer justification against baseline RL metrics before publication. I'd recommend sending it for review with a request to add the missing comparisons on the new definitions.

Referee Report

3 major / 2 minor

Summary. The paper surveys the landscape of world models in autonomous agents, introduces formal definitions of trajectory persistence and representational risk, presents a five-profile attacker taxonomy, and develops a unified threat model integrating MITRE ATLAS and OWASP LLM Top 10. It reports an empirical proof-of-concept on GRU-based RSSM showing 2.26x amplification and -59.5% reward reduction under adversarial fine-tuning, with validation on a stochastic RSSM proxy (0.65x) and a DreamerV3 checkpoint confirming non-zero action drift. The central argument is that these models create distinctive safety, security, and cognitive risks requiring rigour comparable to flight-control software or medical devices, with proposed mitigations spanning adversarial hardening, alignment, and governance.

Significance. If the formal definitions and taxonomy are shown to isolate vulnerabilities beyond standard RL error metrics and existing adversarial frameworks, the work would establish a necessary foundation for treating world models as high-stakes components in robotics and agentic systems. The empirical POC provides concrete evidence of degradation under poisoning and rollout attacks, supporting calls for interdisciplinary safeguards aligned with NIST AI RMF and EU AI Act.

major comments (3)

[formal definitions section] Section introducing formal definitions of trajectory persistence and representational risk: the manuscript does not include an ablation demonstrating that these constructs predict degradation better than baseline rollout variance or compounding error norms; without this, the claim that they capture distinctive risks (beyond reframing known poisoning/rollout attacks) remains unestablished and load-bearing for the central argument.
[empirical POC section] Empirical proof-of-concept section reporting GRU-RSSM results (A1 = 2.26x amplification, -59.5% reward reduction): the abstract and results lack methodology details including data exclusion rules, error bars, statistical tests, and verification procedures for the GRU-based RSSM and DreamerV3 experiments, preventing assessment of whether the observed effects exceed standard adversarial RL baselines.
[attacker taxonomy and threat model section] Section on the five-profile attacker taxonomy and unified threat model: the taxonomy is not shown to exhaust or extend vectors already covered by MITRE ATLAS; an explicit mapping or gap analysis is required to substantiate that the framework adds distinctive coverage rather than overlapping with existing adversarial RL techniques.

minor comments (2)

[abstract] The abstract reports precise numerical results (e.g., 2.26x, -59.5%) without accompanying confidence intervals or baseline comparisons, which reduces clarity for readers assessing the magnitude of effects.
[empirical results] Notation for A1 amplification and reward reduction should be defined explicitly on first use with reference to the underlying equations for trajectory persistence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the work.

read point-by-point responses

Referee: [formal definitions section] Section introducing formal definitions of trajectory persistence and representational risk: the manuscript does not include an ablation demonstrating that these constructs predict degradation better than baseline rollout variance or compounding error norms; without this, the claim that they capture distinctive risks (beyond reframing known poisoning/rollout attacks) remains unestablished and load-bearing for the central argument.

Authors: We agree that an ablation would provide stronger empirical grounding for the claim that trajectory persistence and representational risk isolate vulnerabilities beyond standard RL metrics. The manuscript is structured as a survey with targeted proof-of-concept demonstrations rather than a full comparative empirical study. In the revision we will add a dedicated discussion subsection that (i) formally contrasts the new constructs with rollout variance and compounding error norms and (ii) reports a limited post-hoc comparison using the existing GRU-RSSM and DreamerV3 data to illustrate where the new metrics diverge. A comprehensive ablation study is beyond the current scope but will be noted as future work. revision: partial
Referee: [empirical POC section] Empirical proof-of-concept section reporting GRU-RSSM results (A1 = 2.26x amplification, -59.5% reward reduction): the abstract and results lack methodology details including data exclusion rules, error bars, statistical tests, and verification procedures for the GRU-based RSSM and DreamerV3 experiments, preventing assessment of whether the observed effects exceed standard adversarial RL baselines.

Authors: We accept that the current presentation of the empirical results is insufficiently detailed for reproducibility and baseline comparison. The revised manuscript will expand the empirical section to include: data exclusion criteria, error bars with standard deviations across repeated runs, statistical significance tests, and explicit verification procedures for both the GRU-based RSSM and DreamerV3 checkpoint experiments. These additions will allow readers to evaluate the results against standard adversarial RL baselines. revision: yes
Referee: [attacker taxonomy and threat model section] Section on the five-profile attacker taxonomy and unified threat model: the taxonomy is not shown to exhaust or extend vectors already covered by MITRE ATLAS; an explicit mapping or gap analysis is required to substantiate that the framework adds distinctive coverage rather than overlapping with existing adversarial RL techniques.

Authors: We agree that an explicit mapping is required to substantiate the added value of the taxonomy. The revised manuscript will include a new table that maps each of the five attacker profiles to the relevant MITRE ATLAS techniques, together with a gap analysis. The analysis will highlight extensions in the areas of cognitive risk, goal misgeneralization, and human-factor biases that are not the primary focus of existing adversarial RL frameworks, thereby clarifying how the unified threat model integrates and extends prior work. revision: yes

Circularity Check

0 steps flagged

New formal definitions and taxonomy introduced as independent constructs; no reduction to self-referential inputs

full rationale

The paper introduces formal definitions of trajectory persistence and representational risk plus a five-profile attacker taxonomy as novel constructs, then applies them to survey risks and report empirical POC on GRU-RSSM and DreamerV3. These definitions are presented as new rather than derived from prior fitted parameters or self-citations that reduce to the target claims. The unified threat model explicitly draws on external sources (MITRE ATLAS, OWASP LLM Top 10) and NIST/EU frameworks. No equations, fitted inputs, or self-citation chains are shown to force the central claim by construction; the empirical results (A1 amplification, reward reduction) stand as separate validation. This yields a minor self-citation load at most (score 2) while remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities are identifiable. The new definitions of trajectory persistence and representational risk function as introduced concepts whose grounding cannot be verified without full text.

pith-pipeline@v0.9.0 · 5588 in / 1238 out tokens · 99995 ms · 2026-05-13T22:31:52.596969+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 1 (Trajectory Persistence). ... Ak = E[WM_k]/Ess_k ... trajectory-persistent if A1 ≫ 1 or Ak > 1 for multiple k > 1
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 2 (Representational Risk). R(θ,D) = E[DTV(P*,Pθ)] ... Foundry Problem
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

five-profile attacker taxonomy (White-box, Grey-box, Black-box, Insider, Supply-chain)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 6 internal anchors

[1]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. URL https: //arxiv.org/abs/1803.10122

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Mastering diverse control tasks through world models.Nature, 640:647–653, 2025

Danijar Hafner, Jurgis Pašukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640:647–653, 2025. URL https://www.nature.com/articles/ s41586-025-08744-2

work page 2025
[3]

A path towards autonomous machine intelligence

Yann LeCun. A path towards autonomous machine intelligence. OpenReview, 2022. URL https: //openreview.net/pdf?id=BZ5a1r-kVsf

work page 2022
[4]

Model-based imitation learning for urban driving

Anthony Hu, Gianluca Corrado, Nicolas Griffiths, et al. Model-based imitation learning for urban driving. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/ 2210.07729

work page arXiv 2022
[5]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, et al. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. URLhttps://arxiv.org/abs/2309.17080

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

DriveDreamer: Towards real-world-driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, et al. DriveDreamer: Towards real-world-driven world models for autonomous driving. InEuropean Conference on Computer Vision (ECCV), 2024. URL https://arxiv. org/abs/2309.09777

work page arXiv 2024
[7]

arXiv preprint arXiv:2310.061141(2), 6 (2023)

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, et al. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310.06114

work page arXiv 2024
[8]

Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100, 2025

Chenhao Li, Andreas Krause, and Marco Hutter. Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100, 2025. URL https://arxiv.org/abs/ 2501.10100

work page arXiv 2025
[9]

Genie: Generative interactive environments, 2024

Jake Bruce, Michael Dennis, Ashley Edwards, et al. Genie: Generative interactive environments.arXiv preprint arXiv:2402.15391, 2024. URLhttps://arxiv.org/abs/2402.15391

work page arXiv 2024
[10]

Understanding world or predicting future? A comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024

Jingtao Ding, Yunke Zhang, Yu Shang, Jie Feng, Yuheng Zhang, Zefang Zong, Yuan Yuan, Hongyuan Su, Nian Li, Jinghua Piao, Yucheng Deng, Nicholas Sukiennik, Chen Gao, Fengli Xu, and Yong Li. Understanding world or predicting future? A comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024. URL https://arxiv.org/abs/2411.14499

work page arXiv 2024
[11]

A comprehensive survey on world models for embodied AI.arXiv preprintarXiv:2510.16732, 2025

Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied AI.arXiv preprint arXiv:2510.16732, 2024. URLhttps://arxiv.org/abs/2510.16732

work page arXiv 2024
[12]

Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

Nan Jiang, Alex Kulesza, and Satinder Singh. Hallucinating value: A pitfall of Dyna-style planning with imperfect world models.arXiv preprint arXiv:2006.04363, 2020. URLhttps://arxiv.org/abs/2006.04363

work page internal anchor Pith review Pith/arXiv arXiv 2006
[13]

Four principles for physically interpretable world models.arXiv preprint arXiv:2503.02143, 2025

Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. Four principles for physically interpretable world models.arXiv preprint arXiv:2503.02143, 2025. URL https://arxiv.org/abs/ 2503.02143

work page arXiv 2025
[14]

World models: The safety perspective.arXiv preprint arXiv:2411.07690, 2024

Zifan Zeng, Chongzhe Zhang, Feng Liu, Joseph Sifakis, Qunli Zhang, Shiming Liu, and Peng Wang. World models: The safety perspective.arXiv preprint arXiv:2411.07690, 2024. URL https://arxiv.org/abs/ 2411.07690. 24 Safety, Security, and Cognitive Risks in World Models

work page arXiv 2024
[15]

Deception in reinforced autonomous agents: The unconventional rabbit hat trick in legislation.arXiv preprint arXiv:2405.04325, 2024

Atharvan Dogra, Ameet Deshpande, John Nay, Tanmay Rajpurohit, Ashwin Kalyan, and Balaraman Ravindran. Deception in reinforced autonomous agents: The unconventional rabbit hat trick in legislation.arXiv preprint arXiv:2405.04325, 2024. URLhttps://arxiv.org/abs/2405.04325

work page arXiv 2024
[16]

Dynamic human trust modeling of autonomous agents with varying capability and strategy.arXiv preprint arXiv:2404.19291, 2024

Jason Dekarske, Zhaodan Kong, and Sanjay Joshi. Dynamic human trust modeling of autonomous agents with varying capability and strategy.arXiv preprint arXiv:2404.19291, 2024. URL https://arxiv.org/abs/ 2404.19291

work page arXiv 2024
[17]

MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems

MITRE Corporation. MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems. MITRE Corporation, 2024. URLhttps://atlas.mitre.org/

work page 2024
[18]

OW ASP top 10 for LLM applications

OW ASP Foundation. OW ASP top 10 for LLM applications. OW ASP Foundation, 2025. URLhttps://owasp. org/www-project-top-10-for-large-language-model-applications/

work page 2025
[19]

Robust deep reinforcement learning against adversarial perturbations on state observations

Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2003.08938

work page arXiv 2020
[20]

Robust deep reinforcement learning with adaptive adversarial perturbations in action space.arXiv preprint arXiv:2405.11982, 2024

Qianmei Liu, Yufei Kuang, and Jie Wang. Robust deep reinforcement learning with adaptive adversarial perturbations in action space.arXiv preprint arXiv:2405.11982, 2024. URL https://arxiv.org/abs/ 2405.11982

work page arXiv 2024
[21]

When world models dream wrong: Physical-conditioned adversarial attacks against world models.arXiv preprint arXiv:2602.18739, 2026

Zhixiang Guo, Siyuan Liang, Andras Balogh, Noah Lunberry, Rong-Cheng Tu, Mark Jelasity, and Dacheng Tao. When world models dream wrong: Physical-conditioned adversarial attacks against world models.arXiv preprint arXiv:2602.18739, 2026. URLhttps://arxiv.org/abs/2602.18739

work page arXiv 2026
[22]

Explaining and Harnessing Adversarial Examples

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (ICLR 2015), 2015. URLhttps://arxiv.org/abs/1412.6572

work page internal anchor Pith review Pith/arXiv arXiv 2015
[23]

Deep learning adversarial attacks and defenses in autonomous vehicles

Fawzi Boumazouza et al. Deep learning adversarial attacks and defenses in autonomous vehicles. Artificial Intelligence Review, 2024. URL https://link.springer.com/article/10.1007/ s10462-024-11014-8

work page 2024
[24]

Data poisoning in deep learning: A survey.arXiv preprint arXiv:2503.22759, 2025

Pinlong Zhao, Weiyao Zhu, Pengfei Jiao, Di Gao, and Ou Wu. Data poisoning in deep learning: A survey.arXiv preprint arXiv:2503.22759, 2025. URLhttps://arxiv.org/abs/2503.22759

work page arXiv 2025
[25]

arXiv preprint arXiv:1906.01820 , year =

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems.arXiv preprint arXiv:1906.01820, 2019. URL https: //arxiv.org/abs/1906.01820

work page arXiv 1906
[26]

Sharkey, Jacob Pfau, and David Krueger

Lauro Langosco di Langosco, Jack Koch, Lee D. Sharkey, Jacob Pfau, and David Krueger. Goal misgeneralization in deep reinforcement learning. InProceedings of the 39th International Conference on Machine Learning (ICML),

work page
[27]

URLhttps://arxiv.org/abs/2105.14111

work page arXiv
[28]

Specification gaming: The flip side of AI ingenuity

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Ku- mar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: The flip side of AI ingenuity. DeepMind Blog, 2020. URL https://deepmind.google/discover/blog/ specification-gaming-the-flip-side-of-ai-ingenuity/

work page 2020
[29]

The alignment problem from a deep learning perspective

Richard Ngo, Lawrence Chan, and Soren Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022. URLhttps://arxiv.org/abs/2209.00626

work page arXiv 2022
[30]

SafeDreamer: Safe reinforcement learning with world models

Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, and Yaodong Yang. SafeDreamer: Safe reinforcement learning with world models. InInternational Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2307.07176. 25 Safety, Security, and Cognitive Risks in World Models

work page arXiv 2024
[31]

Mopo: Model-Based Ofﬂine Policy Optimization

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2005.13239

work page arXiv 2020
[32]

MOReL: Model-based offline reinforcement learning

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL: Model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URL https://arxiv.org/abs/2005.05951

work page arXiv 2020
[33]

COMBO: Conservative offline model-based policy optimization

Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative offline model-based policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. URLhttps://arxiv.org/abs/2102.08363

work page arXiv 2021
[34]

Deep reinforcement learning policies learn shared adversarial features across MDPs

Ezgi Korkmaz. Deep reinforcement learning policies learn shared adversarial features across MDPs. InProceedings of the AAAI Conference on Artificial Intelligence, 2022

work page 2022
[35]

Adversarial robust deep reinforcement learning requires redefining robustness

Ezgi Korkmaz. Adversarial robust deep reinforcement learning requires redefining robustness. InProceedings of the AAAI Conference on Artificial Intelligence, 2023

work page 2023
[36]

Detecting adversarial directions in deep reinforcement learning to make robust decisions

Ezgi Korkmaz et al. Detecting adversarial directions in deep reinforcement learning to make robust decisions. In Proceedings of the International Conference on Machine Learning, 2023

work page 2023
[37]

Understanding and diagnosing deep reinforcement learning

Ezgi Korkmaz. Understanding and diagnosing deep reinforcement learning. InProceedings of the International Conference on Machine Learning, 2024

work page 2024
[38]

How to lose inherent counterfactuality in reinforcement learning

Ezgi Korkmaz. How to lose inherent counterfactuality in reinforcement learning. InInternational Conference on Learning Representations, 2026

work page 2026
[39]

Zico Kolter and Eric Wong

J. Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. InProceedings of the International Conference on Machine Learning, 2018

work page 2018
[40]

Certified robustness to adversarial examples with differential privacy

Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. InIEEE Symposium on Security and Privacy, 2019

work page 2019
[41]

Zico Kolter

Jeremy Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. In Proceedings of the International Conference on Machine Learning, 2019

work page 2019
[42]

Provably robust deep learning via adversarially trained smoothed classifiers

Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sébastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[43]

Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li

Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. InProceedings of the International Conference on Machine Learning, 2020

work page 2020
[44]

Towards realistic guarantees: A probabilistic certificate for Smooth- LLM, 2025

Adarsh Kumarappan and Ayushi Mehrotra. Towards realistic guarantees: A probabilistic certificate for Smooth- LLM, 2025

work page 2025
[45]

Humans and automation: Use, misuse, disuse, abuse

Raja Parasuraman and Victor Riley. Humans and automation: Use, misuse, disuse, abuse.Human Factors, 39(2): 230–253, 1997. URLhttps://doi.org/10.1518/001872097778543886

work page doi:10.1518/001872097778543886 1997
[46]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2:665–673,

work page
[47]

URLhttps://doi.org/10.1038/s42256-020-00257-z

work page doi:10.1038/s42256-020-00257-z
[48]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. URLhttps://arxiv.org/abs/1809.01999. 26 Safety, Security, and Cognitive Risks in World Models

work page arXiv 2018
[49]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. URL https://arxiv.org/abs/1912. 01603

work page internal anchor Pith review Pith/arXiv arXiv 1912
[50]

Sora: Creating video from text

OpenAI. Sora: Creating video from text. OpenAI Technical Report, 2024. URL https://openai.com/ index/sora/

work page 2024
[51]

LLM-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv preprint arXiv:2509.18970, 2025

Xixun Lin, Yucheng Ning, Jingwen Zhang, Yan Dong, Yilong Liu, Yongxuan Wu, Xiaohua Qi, Nan Sun, Yanmin Shang, Pengfei Cao, Lixin Zou, Xu Chen, Chuan Zhou, Jia Wu, Shirui Pan, Bin Wang, Yanan Cao, Kai Chen, Songlin Hu, and Li Guo. LLM-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv preprint arXiv:2509.18970, 202...

work page arXiv 2025
[52]

The foundry problem: World models and the missing liabil- ity framework for self-supervised learning

Stanford CodeX. The foundry problem: World models and the missing liabil- ity framework for self-supervised learning. Stanford Center for Legal Informat- ics (CodeX) Blog, 2026. URL https://law.stanford.edu/2026/03/06/ the-foundry-problem-world-models-and-the-missing-liability-framework-for-self-supervised-learning/

work page 2026
[53]

Poisoning attacks against machine learning

NIST. Poisoning attacks against machine learning. NIST Technical Report, 2022. URL https://tsapps. nist.gov/publication/get_pdf.cfm?pub_id=934932

work page 2022
[54]

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.arXiv preprint arXiv:2502.05206, 2025. URLhttps://arxiv.org/abs/2502.05206

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Automation bias in human-AI collaboration: A review.AI & Society, 2025

Bochao Zou et al. Automation bias in human-AI collaboration: A review.AI & Society, 2025. URL https: //link.springer.com/article/10.1007/s00146-025-02422-7

work page doi:10.1007/s00146-025-02422-7 2025
[56]

Robust multi-agent reinforcement learning against adversarial attacks for cooperative self- driving vehicles.IET Radar, Sonar & Navigation, 2025

Guoxin Wang et al. Robust multi-agent reinforcement learning against adversarial attacks for cooperative self- driving vehicles.IET Radar, Sonar & Navigation, 2025. URL https://ietresearch.onlinelibrary. wiley.com/doi/10.1049/rsn2.70033

work page doi:10.1049/rsn2.70033 2025
[57]

Talvitie, Michael Bowling, and Martha White

Farzane Aminmansour, Taher Jafferjee, Ehsan Imani, Erin J. Talvitie, Michael Bowling, and Martha White. Mitigating value hallucination in Dyna-style planning via multistep predecessor models.Journal of Artificial In- telligence Research, 80:441–473, 2024. URL https://www.jair.org/index.php/jair/article/ view/15155

work page 2024
[58]

An empirical study on hallucinations in embodied agents

Sinan Zeng et al. An empirical study on hallucinations in embodied agents. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2025, 2025. URL https://aclanthology.org/2025. findings-emnlp.1158.pdf

work page 2025
[59]

Lyapunov density models: Constraining distribution shift in learning-based control

Katie Kang, Paula Gradu, Jason Choi, Michael Janner, Claire Tomlin, and Sergey Levine. Lyapunov density models: Constraining distribution shift in learning-based control. InProceedings of the 39th International Conference on Machine Learning (ICML), 2022. URLhttps://arxiv.org/abs/2206.10524

work page arXiv 2022
[60]

Bounding distributional shifts in world modeling through novelty detection

Eric Jing and Abdeslam Boularias. Bounding distributional shifts in world modeling through novelty detection. arXiv preprint arXiv:2508.06096, 2025. URLhttps://arxiv.org/abs/2508.06096

work page arXiv 2025
[61]

A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025. URLhttps://arxiv.org/abs/2501.11260

work page arXiv 2025
[62]

AdvSim: Generating safety-critical scenarios for self-driving vehicles

Jingkang Wang, Ava Pun, James Tu, et al. AdvSim: Generating safety-critical scenarios for self-driving vehicles. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. URL https: //openaccess.thecvf.com/content/CVPR2021/papers/Wang_AdvSim_Generating_ Safety-Critical_Scenarios_for_Self-Driving_Vehicles_CVPR_2021_paper.pdf. 27 Safety...

work page 2021
[63]

A survey on model extraction attacks and defenses for large language models.arXiv preprint arXiv:2506.22521, 2025

Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, and Yushun Dong. A survey on model extraction attacks and defenses for large language models.arXiv preprint arXiv:2506.22521, 2025. URL https://arxiv.org/abs/2506.22521

work page arXiv 2025
[64]

Algorithms that remember: Model inversion attacks and data protection law.Philosophical Transactions of the Royal Society A, 376(2133):20180083, 2018

Michael Veale, Reuben Binns, and Lilian Edwards. Algorithms that remember: Model inversion attacks and data protection law.Philosophical Transactions of the Royal Society A, 376(2133):20180083, 2018. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC6191664/

work page 2018
[65]

Privacy leakage on DNNs: A survey of model inversion attacks and defenses.arXiv preprint arXiv:2402.04013, 2024

Hao Fang, Yixiang Qiu, Hongyao Yu, Wenbo Yu, Jiawei Kong, Baoli Chong, Bin Chen, Xuan Wang, Shu-Tao Xia, and Ke Xu. Privacy leakage on DNNs: A survey of model inversion attacks and defenses.arXiv preprint arXiv:2402.04013, 2024. URLhttps://arxiv.org/abs/2402.04013

work page arXiv 2024
[66]

Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges.arXiv preprint arXiv:2507.19672, 2025

Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, et al. Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges.arXiv preprint arXiv:2507.19672, 2025. URL https://arxiv.org/abs/2507. 19672

work page arXiv 2025
[67]

Misalignment or misuse? The AGI alignment tradeoff.arXiv preprint arXiv:2506.03755, 2025

Max Hellrigel-Holderbaum and Leonard Dung. Misalignment or misuse? The AGI alignment tradeoff.arXiv preprint arXiv:2506.03755, 2025. URLhttps://arxiv.org/abs/2506.03755

work page arXiv 2025
[68]

A survey on progress in LLM alignment from the perspective of reward design.arXiv preprint arXiv:2505.02666, 2025

Miaomiao Ji, Yanqiu Wu, Zhibin Wu, Shoujin Wang, Jian Yang, Mark Dras, and Usman Naseem. A survey on progress in LLM alignment from the perspective of reward design.arXiv preprint arXiv:2505.02666, 2025. URL https://arxiv.org/abs/2505.02666

work page arXiv 2025
[69]

Current agents fail to leverage world model as tool for foresight.arXiv preprint arXiv:2601.03905, 2026

Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani- Tür, Gokhan Tur, Yunzhu Li, et al. Current agents fail to leverage world model as tool for foresight.arXiv preprint arXiv:2601.03905, 2026. URLhttps://arxiv.org/abs/2601.03905

work page arXiv 2026
[70]

When counterfactual reasoning fails: Chaos and real-world complexity.arXiv preprint arXiv:2503.23820, 2025

Yahya Aalaila, Gerrit Großmann, Sumantrak Mukherjee, Jonas Wahl, and Sebastian V ollmer. When counterfactual reasoning fails: Chaos and real-world complexity.arXiv preprint arXiv:2503.23820, 2025. URL https: //arxiv.org/abs/2503.23820

work page arXiv 2025
[71]

Risk alignment in agentic AI systems.arXiv preprint arXiv:2410.01927, 2024

Hayley Clatterbuck, Clinton Castro, and Arvo Muñoz Morán. Risk alignment in agentic AI systems.arXiv preprint arXiv:2410.01927, 2024. URLhttps://arxiv.org/abs/2410.01927

work page arXiv 2024
[72]

Artificial intelligence risk management framework (AI RMF 1.0), NIST AI 100-1

National Institute of Standards and Technology. Artificial intelligence risk management framework (AI RMF 1.0), NIST AI 100-1. Technical report, NIST, 2023. URLhttps://doi.org/10.6028/NIST.AI.100-1

work page doi:10.6028/nist.ai.100-1 2023
[73]

Artificial intelligence risk management framework: Generative artificial intelligence profile (NIST AI 600-1)

National Institute of Standards and Technology. Artificial intelligence risk management framework: Generative artificial intelligence profile (NIST AI 600-1). Technical report, NIST, 2024. URL https://nvlpubs.nist. gov/nistpubs/ai/NIST.AI.600-1.pdf

work page 2024
[74]

Regulation (EU) 2024/1689 of the european parliament and of the council — artificial intelligence act

European Union. Regulation (EU) 2024/1689 of the european parliament and of the council — artificial intelligence act. Official Journal of the European Union, 2024. URL https://eur-lex.europa.eu/eli/reg/ 2024/1689/oj/eng. 28 Safety, Security, and Cognitive Risks in World Models A Supplementary Figures 5 10 15 20 25 30 Rollout step k 0.00 0.01 0.02 0.03 0....

work page 2024