Recognition: 3 theorem links
· Lean TheoremSafety, Security, and Cognitive Risks in World Models
Pith reviewed 2026-05-13 22:31 UTC · model grok-4.3
The pith
World models introduce trajectory persistence and representational risks that let adversaries degrade safety-critical AI agents through data corruption and rollout errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
World models require the same rigour as flight-control software or medical devices because adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause significant degradation in safety-critical deployments, while also enabling goal misgeneralisation, deceptive alignment, and human miscalibration of trust.
What carries the argument
Formal definitions of trajectory persistence and representational risk, together with a five-profile attacker taxonomy unified under MITRE ATLAS and OWASP LLM Top 10.
If this is right
- Adversarial fine-tuning on GRU-based RSSMs produces 2.26 times amplification of attack effects and 59.5 percent reward reduction.
- Stochastic RSSM proxies exhibit lower attack amplification at 0.65 times, indicating architecture dependence.
- Real DreamerV3 checkpoints already exhibit non-zero action drift under the same attack patterns.
- Alignment-layer risks such as goal misgeneralisation and reward hacking become more feasible once persistent trajectories are available.
- Human operators face automation bias and planning hallucination when relying on authoritative world-model predictions.
Where Pith is reading between the lines
- Existing safety benchmarks that test only final actions may miss the compounding effects that arise specifically from persistent world-model errors.
- The same taxonomy could be applied to test whether language-model world models in agentic systems inherit similar persistence properties.
- Governance frameworks such as the EU AI Act may need explicit clauses for latent-state integrity in addition to output safety.
- Interdisciplinary teams combining adversarial ML and control-theory verification could develop quantitative bounds on acceptable rollout error.
Load-bearing premise
The introduced formal definitions of trajectory persistence and representational risk, along with the five-profile attacker taxonomy, accurately and comprehensively capture the distinctive risks in world model-equipped agents.
What would settle it
A controlled experiment on a deployed world-model agent that shows no measurable increase in trajectory drift or reward degradation after systematic attempts to poison the training set and inject rollout perturbations.
Figures
read the original abstract
World models - learned internal simulators of environment dynamics - are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. By predicting future states in compressed latent spaces, they enable sample-efficient planning and long-horizon imagination without direct environment interaction. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause significant degradation in safety-critical deployments. At the alignment layer, world model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking. At the human layer, authoritative world model predictions foster automation bias, miscalibrated trust, and planning hallucination. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five-profile attacker taxonomy; and develops a unified threat model drawing on MITRE ATLAS and the OWASP LLM Top 10. We provide an empirical proof-of-concept demonstrating trajectory-persistent adversarial attacks on a GRU-based RSSM ($\mathcal{A}_1 = 2.26\times$ amplification, $-59.5\%$ reward reduction under adversarial fine-tuning), validate architecture-dependence via a stochastic RSSM proxy ($\mathcal{A}_1 = 0.65\times$), and probe a real DreamerV3 checkpoint (non-zero action drift confirmed). We propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human-factors design, arguing that world models require the same rigour as flight-control software or medical devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys the landscape of world models in autonomous agents, introduces formal definitions of trajectory persistence and representational risk, presents a five-profile attacker taxonomy, and develops a unified threat model integrating MITRE ATLAS and OWASP LLM Top 10. It reports an empirical proof-of-concept on GRU-based RSSM showing 2.26x amplification and -59.5% reward reduction under adversarial fine-tuning, with validation on a stochastic RSSM proxy (0.65x) and a DreamerV3 checkpoint confirming non-zero action drift. The central argument is that these models create distinctive safety, security, and cognitive risks requiring rigour comparable to flight-control software or medical devices, with proposed mitigations spanning adversarial hardening, alignment, and governance.
Significance. If the formal definitions and taxonomy are shown to isolate vulnerabilities beyond standard RL error metrics and existing adversarial frameworks, the work would establish a necessary foundation for treating world models as high-stakes components in robotics and agentic systems. The empirical POC provides concrete evidence of degradation under poisoning and rollout attacks, supporting calls for interdisciplinary safeguards aligned with NIST AI RMF and EU AI Act.
major comments (3)
- [formal definitions section] Section introducing formal definitions of trajectory persistence and representational risk: the manuscript does not include an ablation demonstrating that these constructs predict degradation better than baseline rollout variance or compounding error norms; without this, the claim that they capture distinctive risks (beyond reframing known poisoning/rollout attacks) remains unestablished and load-bearing for the central argument.
- [empirical POC section] Empirical proof-of-concept section reporting GRU-RSSM results (A1 = 2.26x amplification, -59.5% reward reduction): the abstract and results lack methodology details including data exclusion rules, error bars, statistical tests, and verification procedures for the GRU-based RSSM and DreamerV3 experiments, preventing assessment of whether the observed effects exceed standard adversarial RL baselines.
- [attacker taxonomy and threat model section] Section on the five-profile attacker taxonomy and unified threat model: the taxonomy is not shown to exhaust or extend vectors already covered by MITRE ATLAS; an explicit mapping or gap analysis is required to substantiate that the framework adds distinctive coverage rather than overlapping with existing adversarial RL techniques.
minor comments (2)
- [abstract] The abstract reports precise numerical results (e.g., 2.26x, -59.5%) without accompanying confidence intervals or baseline comparisons, which reduces clarity for readers assessing the magnitude of effects.
- [empirical results] Notation for A1 amplification and reward reduction should be defined explicitly on first use with reference to the underlying equations for trajectory persistence.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the work.
read point-by-point responses
-
Referee: [formal definitions section] Section introducing formal definitions of trajectory persistence and representational risk: the manuscript does not include an ablation demonstrating that these constructs predict degradation better than baseline rollout variance or compounding error norms; without this, the claim that they capture distinctive risks (beyond reframing known poisoning/rollout attacks) remains unestablished and load-bearing for the central argument.
Authors: We agree that an ablation would provide stronger empirical grounding for the claim that trajectory persistence and representational risk isolate vulnerabilities beyond standard RL metrics. The manuscript is structured as a survey with targeted proof-of-concept demonstrations rather than a full comparative empirical study. In the revision we will add a dedicated discussion subsection that (i) formally contrasts the new constructs with rollout variance and compounding error norms and (ii) reports a limited post-hoc comparison using the existing GRU-RSSM and DreamerV3 data to illustrate where the new metrics diverge. A comprehensive ablation study is beyond the current scope but will be noted as future work. revision: partial
-
Referee: [empirical POC section] Empirical proof-of-concept section reporting GRU-RSSM results (A1 = 2.26x amplification, -59.5% reward reduction): the abstract and results lack methodology details including data exclusion rules, error bars, statistical tests, and verification procedures for the GRU-based RSSM and DreamerV3 experiments, preventing assessment of whether the observed effects exceed standard adversarial RL baselines.
Authors: We accept that the current presentation of the empirical results is insufficiently detailed for reproducibility and baseline comparison. The revised manuscript will expand the empirical section to include: data exclusion criteria, error bars with standard deviations across repeated runs, statistical significance tests, and explicit verification procedures for both the GRU-based RSSM and DreamerV3 checkpoint experiments. These additions will allow readers to evaluate the results against standard adversarial RL baselines. revision: yes
-
Referee: [attacker taxonomy and threat model section] Section on the five-profile attacker taxonomy and unified threat model: the taxonomy is not shown to exhaust or extend vectors already covered by MITRE ATLAS; an explicit mapping or gap analysis is required to substantiate that the framework adds distinctive coverage rather than overlapping with existing adversarial RL techniques.
Authors: We agree that an explicit mapping is required to substantiate the added value of the taxonomy. The revised manuscript will include a new table that maps each of the five attacker profiles to the relevant MITRE ATLAS techniques, together with a gap analysis. The analysis will highlight extensions in the areas of cognitive risk, goal misgeneralization, and human-factor biases that are not the primary focus of existing adversarial RL frameworks, thereby clarifying how the unified threat model integrates and extends prior work. revision: yes
Circularity Check
New formal definitions and taxonomy introduced as independent constructs; no reduction to self-referential inputs
full rationale
The paper introduces formal definitions of trajectory persistence and representational risk plus a five-profile attacker taxonomy as novel constructs, then applies them to survey risks and report empirical POC on GRU-RSSM and DreamerV3. These definitions are presented as new rather than derived from prior fitted parameters or self-citations that reduce to the target claims. The unified threat model explicitly draws on external sources (MITRE ATLAS, OWASP LLM Top 10) and NIST/EU frameworks. No equations, fitted inputs, or self-citation chains are shown to force the central claim by construction; the empirical results (A1 amplification, reward reduction) stand as separate validation. This yields a minor self-citation load at most (score 2) while remaining self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 1 (Trajectory Persistence). ... Ak = E[WM_k]/Ess_k ... trajectory-persistent if A1 ≫ 1 or Ak > 1 for multiple k > 1
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 2 (Representational Risk). R(θ,D) = E[DTV(P*,Pθ)] ... Foundry Problem
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
five-profile attacker taxonomy (White-box, Grey-box, Black-box, Insider, Supply-chain)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. URL https: //arxiv.org/abs/1803.10122
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Mastering diverse control tasks through world models.Nature, 640:647–653, 2025
Danijar Hafner, Jurgis Pašukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640:647–653, 2025. URL https://www.nature.com/articles/ s41586-025-08744-2
work page 2025
-
[3]
A path towards autonomous machine intelligence
Yann LeCun. A path towards autonomous machine intelligence. OpenReview, 2022. URL https: //openreview.net/pdf?id=BZ5a1r-kVsf
work page 2022
-
[4]
Model-based imitation learning for urban driving
Anthony Hu, Gianluca Corrado, Nicolas Griffiths, et al. Model-based imitation learning for urban driving. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/ 2210.07729
-
[5]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, et al. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. URLhttps://arxiv.org/abs/2309.17080
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
DriveDreamer: Towards real-world-driven world models for autonomous driving
Xiaofeng Wang, Zheng Zhu, Guan Huang, et al. DriveDreamer: Towards real-world-driven world models for autonomous driving. InEuropean Conference on Computer Vision (ECCV), 2024. URL https://arxiv. org/abs/2309.09777
-
[7]
arXiv preprint arXiv:2310.061141(2), 6 (2023)
Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, et al. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310.06114
-
[8]
Chenhao Li, Andreas Krause, and Marco Hutter. Robotic world model: A neural network simulator for robust policy optimization in robotics.arXiv preprint arXiv:2501.10100, 2025. URL https://arxiv.org/abs/ 2501.10100
-
[9]
Genie: Generative interactive environments, 2024
Jake Bruce, Michael Dennis, Ashley Edwards, et al. Genie: Generative interactive environments.arXiv preprint arXiv:2402.15391, 2024. URLhttps://arxiv.org/abs/2402.15391
-
[10]
Jingtao Ding, Yunke Zhang, Yu Shang, Jie Feng, Yuheng Zhang, Zefang Zong, Yuan Yuan, Hongyuan Su, Nian Li, Jinghua Piao, Yucheng Deng, Nicholas Sukiennik, Chen Gao, Fengli Xu, and Yong Li. Understanding world or predicting future? A comprehensive survey of world models.arXiv preprint arXiv:2411.14499, 2024. URL https://arxiv.org/abs/2411.14499
-
[11]
A comprehensive survey on world models for embodied AI.arXiv preprintarXiv:2510.16732, 2025
Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied AI.arXiv preprint arXiv:2510.16732, 2024. URLhttps://arxiv.org/abs/2510.16732
-
[12]
Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models
Nan Jiang, Alex Kulesza, and Satinder Singh. Hallucinating value: A pitfall of Dyna-style planning with imperfect world models.arXiv preprint arXiv:2006.04363, 2020. URLhttps://arxiv.org/abs/2006.04363
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[13]
Four principles for physically interpretable world models.arXiv preprint arXiv:2503.02143, 2025
Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. Four principles for physically interpretable world models.arXiv preprint arXiv:2503.02143, 2025. URL https://arxiv.org/abs/ 2503.02143
-
[14]
World models: The safety perspective.arXiv preprint arXiv:2411.07690, 2024
Zifan Zeng, Chongzhe Zhang, Feng Liu, Joseph Sifakis, Qunli Zhang, Shiming Liu, and Peng Wang. World models: The safety perspective.arXiv preprint arXiv:2411.07690, 2024. URL https://arxiv.org/abs/ 2411.07690. 24 Safety, Security, and Cognitive Risks in World Models
-
[15]
Atharvan Dogra, Ameet Deshpande, John Nay, Tanmay Rajpurohit, Ashwin Kalyan, and Balaraman Ravindran. Deception in reinforced autonomous agents: The unconventional rabbit hat trick in legislation.arXiv preprint arXiv:2405.04325, 2024. URLhttps://arxiv.org/abs/2405.04325
-
[16]
Jason Dekarske, Zhaodan Kong, and Sanjay Joshi. Dynamic human trust modeling of autonomous agents with varying capability and strategy.arXiv preprint arXiv:2404.19291, 2024. URL https://arxiv.org/abs/ 2404.19291
-
[17]
MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems
MITRE Corporation. MITRE ATLAS: Adversarial threat landscape for artificial-intelligence systems. MITRE Corporation, 2024. URLhttps://atlas.mitre.org/
work page 2024
-
[18]
OW ASP top 10 for LLM applications
OW ASP Foundation. OW ASP top 10 for LLM applications. OW ASP Foundation, 2025. URLhttps://owasp. org/www-project-top-10-for-large-language-model-applications/
work page 2025
-
[19]
Robust deep reinforcement learning against adversarial perturbations on state observations
Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2003.08938
-
[20]
Qianmei Liu, Yufei Kuang, and Jie Wang. Robust deep reinforcement learning with adaptive adversarial perturbations in action space.arXiv preprint arXiv:2405.11982, 2024. URL https://arxiv.org/abs/ 2405.11982
-
[21]
Zhixiang Guo, Siyuan Liang, Andras Balogh, Noah Lunberry, Rong-Cheng Tu, Mark Jelasity, and Dacheng Tao. When world models dream wrong: Physical-conditioned adversarial attacks against world models.arXiv preprint arXiv:2602.18739, 2026. URLhttps://arxiv.org/abs/2602.18739
-
[22]
Explaining and Harnessing Adversarial Examples
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (ICLR 2015), 2015. URLhttps://arxiv.org/abs/1412.6572
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[23]
Deep learning adversarial attacks and defenses in autonomous vehicles
Fawzi Boumazouza et al. Deep learning adversarial attacks and defenses in autonomous vehicles. Artificial Intelligence Review, 2024. URL https://link.springer.com/article/10.1007/ s10462-024-11014-8
work page 2024
-
[24]
Data poisoning in deep learning: A survey.arXiv preprint arXiv:2503.22759, 2025
Pinlong Zhao, Weiyao Zhu, Pengfei Jiao, Di Gao, and Ou Wu. Data poisoning in deep learning: A survey.arXiv preprint arXiv:2503.22759, 2025. URLhttps://arxiv.org/abs/2503.22759
-
[25]
arXiv preprint arXiv:1906.01820 , year =
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems.arXiv preprint arXiv:1906.01820, 2019. URL https: //arxiv.org/abs/1906.01820
-
[26]
Sharkey, Jacob Pfau, and David Krueger
Lauro Langosco di Langosco, Jack Koch, Lee D. Sharkey, Jacob Pfau, and David Krueger. Goal misgeneralization in deep reinforcement learning. InProceedings of the 39th International Conference on Machine Learning (ICML),
- [27]
-
[28]
Specification gaming: The flip side of AI ingenuity
Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Ku- mar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: The flip side of AI ingenuity. DeepMind Blog, 2020. URL https://deepmind.google/discover/blog/ specification-gaming-the-flip-side-of-ai-ingenuity/
work page 2020
-
[29]
The alignment problem from a deep learning perspective
Richard Ngo, Lawrence Chan, and Soren Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022. URLhttps://arxiv.org/abs/2209.00626
-
[30]
SafeDreamer: Safe reinforcement learning with world models
Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, and Yaodong Yang. SafeDreamer: Safe reinforcement learning with world models. InInternational Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2307.07176. 25 Safety, Security, and Cognitive Risks in World Models
-
[31]
Mopo: Model-Based Offline Policy Optimization
Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2005.13239
-
[32]
MOReL: Model-based offline reinforcement learning
Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL: Model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URL https://arxiv.org/abs/2005.05951
-
[33]
COMBO: Conservative offline model-based policy optimization
Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative offline model-based policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. URLhttps://arxiv.org/abs/2102.08363
-
[34]
Deep reinforcement learning policies learn shared adversarial features across MDPs
Ezgi Korkmaz. Deep reinforcement learning policies learn shared adversarial features across MDPs. InProceedings of the AAAI Conference on Artificial Intelligence, 2022
work page 2022
-
[35]
Adversarial robust deep reinforcement learning requires redefining robustness
Ezgi Korkmaz. Adversarial robust deep reinforcement learning requires redefining robustness. InProceedings of the AAAI Conference on Artificial Intelligence, 2023
work page 2023
-
[36]
Detecting adversarial directions in deep reinforcement learning to make robust decisions
Ezgi Korkmaz et al. Detecting adversarial directions in deep reinforcement learning to make robust decisions. In Proceedings of the International Conference on Machine Learning, 2023
work page 2023
-
[37]
Understanding and diagnosing deep reinforcement learning
Ezgi Korkmaz. Understanding and diagnosing deep reinforcement learning. InProceedings of the International Conference on Machine Learning, 2024
work page 2024
-
[38]
How to lose inherent counterfactuality in reinforcement learning
Ezgi Korkmaz. How to lose inherent counterfactuality in reinforcement learning. InInternational Conference on Learning Representations, 2026
work page 2026
-
[39]
J. Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. InProceedings of the International Conference on Machine Learning, 2018
work page 2018
-
[40]
Certified robustness to adversarial examples with differential privacy
Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. InIEEE Symposium on Security and Privacy, 2019
work page 2019
-
[41]
Jeremy Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. In Proceedings of the International Conference on Machine Learning, 2019
work page 2019
-
[42]
Provably robust deep learning via adversarially trained smoothed classifiers
Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sébastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. InAdvances in Neural Information Processing Systems, 2019
work page 2019
-
[43]
Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li
Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. InProceedings of the International Conference on Machine Learning, 2020
work page 2020
-
[44]
Towards realistic guarantees: A probabilistic certificate for Smooth- LLM, 2025
Adarsh Kumarappan and Ayushi Mehrotra. Towards realistic guarantees: A probabilistic certificate for Smooth- LLM, 2025
work page 2025
-
[45]
Humans and automation: Use, misuse, disuse, abuse
Raja Parasuraman and Victor Riley. Humans and automation: Use, misuse, disuse, abuse.Human Factors, 39(2): 230–253, 1997. URLhttps://doi.org/10.1518/001872097778543886
- [46]
-
[47]
URLhttps://doi.org/10.1038/s42256-020-00257-z
-
[48]
Recurrent world models facilitate policy evolution
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. URLhttps://arxiv.org/abs/1809.01999. 26 Safety, Security, and Cognitive Risks in World Models
-
[49]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. URL https://arxiv.org/abs/1912. 01603
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[50]
Sora: Creating video from text
OpenAI. Sora: Creating video from text. OpenAI Technical Report, 2024. URL https://openai.com/ index/sora/
work page 2024
-
[51]
Xixun Lin, Yucheng Ning, Jingwen Zhang, Yan Dong, Yilong Liu, Yongxuan Wu, Xiaohua Qi, Nan Sun, Yanmin Shang, Pengfei Cao, Lixin Zou, Xu Chen, Chuan Zhou, Jia Wu, Shirui Pan, Bin Wang, Yanan Cao, Kai Chen, Songlin Hu, and Li Guo. LLM-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv preprint arXiv:2509.18970, 202...
-
[52]
The foundry problem: World models and the missing liabil- ity framework for self-supervised learning
Stanford CodeX. The foundry problem: World models and the missing liabil- ity framework for self-supervised learning. Stanford Center for Legal Informat- ics (CodeX) Blog, 2026. URL https://law.stanford.edu/2026/03/06/ the-foundry-problem-world-models-and-the-missing-liability-framework-for-self-supervised-learning/
work page 2026
-
[53]
Poisoning attacks against machine learning
NIST. Poisoning attacks against machine learning. NIST Technical Report, 2022. URL https://tsapps. nist.gov/publication/get_pdf.cfm?pub_id=934932
work page 2022
-
[54]
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.arXiv preprint arXiv:2502.05206, 2025. URLhttps://arxiv.org/abs/2502.05206
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Automation bias in human-AI collaboration: A review.AI & Society, 2025
Bochao Zou et al. Automation bias in human-AI collaboration: A review.AI & Society, 2025. URL https: //link.springer.com/article/10.1007/s00146-025-02422-7
-
[56]
Guoxin Wang et al. Robust multi-agent reinforcement learning against adversarial attacks for cooperative self- driving vehicles.IET Radar, Sonar & Navigation, 2025. URL https://ietresearch.onlinelibrary. wiley.com/doi/10.1049/rsn2.70033
-
[57]
Talvitie, Michael Bowling, and Martha White
Farzane Aminmansour, Taher Jafferjee, Ehsan Imani, Erin J. Talvitie, Michael Bowling, and Martha White. Mitigating value hallucination in Dyna-style planning via multistep predecessor models.Journal of Artificial In- telligence Research, 80:441–473, 2024. URL https://www.jair.org/index.php/jair/article/ view/15155
work page 2024
-
[58]
An empirical study on hallucinations in embodied agents
Sinan Zeng et al. An empirical study on hallucinations in embodied agents. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2025, 2025. URL https://aclanthology.org/2025. findings-emnlp.1158.pdf
work page 2025
-
[59]
Lyapunov density models: Constraining distribution shift in learning-based control
Katie Kang, Paula Gradu, Jason Choi, Michael Janner, Claire Tomlin, and Sergey Levine. Lyapunov density models: Constraining distribution shift in learning-based control. InProceedings of the 39th International Conference on Machine Learning (ICML), 2022. URLhttps://arxiv.org/abs/2206.10524
-
[60]
Bounding distributional shifts in world modeling through novelty detection
Eric Jing and Abdeslam Boularias. Bounding distributional shifts in world modeling through novelty detection. arXiv preprint arXiv:2508.06096, 2025. URLhttps://arxiv.org/abs/2508.06096
-
[61]
A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025
Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025. URLhttps://arxiv.org/abs/2501.11260
-
[62]
AdvSim: Generating safety-critical scenarios for self-driving vehicles
Jingkang Wang, Ava Pun, James Tu, et al. AdvSim: Generating safety-critical scenarios for self-driving vehicles. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. URL https: //openaccess.thecvf.com/content/CVPR2021/papers/Wang_AdvSim_Generating_ Safety-Critical_Scenarios_for_Self-Driving_Vehicles_CVPR_2021_paper.pdf. 27 Safety...
work page 2021
-
[63]
Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, and Yushun Dong. A survey on model extraction attacks and defenses for large language models.arXiv preprint arXiv:2506.22521, 2025. URL https://arxiv.org/abs/2506.22521
-
[64]
Michael Veale, Reuben Binns, and Lilian Edwards. Algorithms that remember: Model inversion attacks and data protection law.Philosophical Transactions of the Royal Society A, 376(2133):20180083, 2018. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC6191664/
work page 2018
-
[65]
Hao Fang, Yixiang Qiu, Hongyao Yu, Wenbo Yu, Jiawei Kong, Baoli Chong, Bin Chen, Xuan Wang, Shu-Tao Xia, and Ke Xu. Privacy leakage on DNNs: A survey of model inversion attacks and defenses.arXiv preprint arXiv:2402.04013, 2024. URLhttps://arxiv.org/abs/2402.04013
-
[66]
Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, et al. Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges.arXiv preprint arXiv:2507.19672, 2025. URL https://arxiv.org/abs/2507. 19672
-
[67]
Misalignment or misuse? The AGI alignment tradeoff.arXiv preprint arXiv:2506.03755, 2025
Max Hellrigel-Holderbaum and Leonard Dung. Misalignment or misuse? The AGI alignment tradeoff.arXiv preprint arXiv:2506.03755, 2025. URLhttps://arxiv.org/abs/2506.03755
-
[68]
Miaomiao Ji, Yanqiu Wu, Zhibin Wu, Shoujin Wang, Jian Yang, Mark Dras, and Usman Naseem. A survey on progress in LLM alignment from the perspective of reward design.arXiv preprint arXiv:2505.02666, 2025. URL https://arxiv.org/abs/2505.02666
-
[69]
Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani- Tür, Gokhan Tur, Yunzhu Li, et al. Current agents fail to leverage world model as tool for foresight.arXiv preprint arXiv:2601.03905, 2026. URLhttps://arxiv.org/abs/2601.03905
-
[70]
Yahya Aalaila, Gerrit Großmann, Sumantrak Mukherjee, Jonas Wahl, and Sebastian V ollmer. When counterfactual reasoning fails: Chaos and real-world complexity.arXiv preprint arXiv:2503.23820, 2025. URL https: //arxiv.org/abs/2503.23820
-
[71]
Risk alignment in agentic AI systems.arXiv preprint arXiv:2410.01927, 2024
Hayley Clatterbuck, Clinton Castro, and Arvo Muñoz Morán. Risk alignment in agentic AI systems.arXiv preprint arXiv:2410.01927, 2024. URLhttps://arxiv.org/abs/2410.01927
-
[72]
Artificial intelligence risk management framework (AI RMF 1.0), NIST AI 100-1
National Institute of Standards and Technology. Artificial intelligence risk management framework (AI RMF 1.0), NIST AI 100-1. Technical report, NIST, 2023. URLhttps://doi.org/10.6028/NIST.AI.100-1
-
[73]
National Institute of Standards and Technology. Artificial intelligence risk management framework: Generative artificial intelligence profile (NIST AI 600-1). Technical report, NIST, 2024. URL https://nvlpubs.nist. gov/nistpubs/ai/NIST.AI.600-1.pdf
work page 2024
-
[74]
European Union. Regulation (EU) 2024/1689 of the european parliament and of the council — artificial intelligence act. Official Journal of the European Union, 2024. URL https://eur-lex.europa.eu/eli/reg/ 2024/1689/oj/eng. 28 Safety, Security, and Cognitive Risks in World Models A Supplementary Figures 5 10 15 20 25 30 Rollout step k 0.00 0.01 0.02 0.03 0....
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.