Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security
Pith reviewed 2026-06-30 19:16 UTC · model grok-4.3
The pith
Agentic AI systems require stage-specific safeguards for safety, robustness, privacy, and security to handle multi-step failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agentic AI introduces new failure modes through multi-step trajectories, and the survey addresses trustworthiness by mapping risks to workflow stages within Safety and Robustness as well as Privacy and System Security, while consolidating evaluation resources into a single metrics-and-benchmarks hub that includes scenario guidance for release decisions.
What carries the argument
The unified metrics-and-benchmarks hub that consolidates outcome and process signals (such as constraint violations, trace completeness, and adversarial success rates) with scenario-to-metric guidance for evaluation.
If this is right
- Workflow-stage risk mapping enables targeted interventions that reduce specific failure modes such as constraint violations.
- The metrics-and-benchmarks hub supports consistent comparison across systems using both outcome and process measures.
- Scenario-to-metric guidance improves release gating by linking evaluation signals directly to deployment contexts.
- Attention to listed open challenges like runtime monitoring will be required to maintain trustworthiness as agents evolve.
Where Pith is reading between the lines
- The survey's workflow focus could be extended to test whether adding verification checkpoints at each stage measurably lowers adversarial success rates in controlled experiments.
- The case study of open-source failures implies that public disclosure of attack traces might accelerate community development of the metrics hub.
- Treating the hub as a living resource would require periodic updates tied to new agent architectures to keep the scenario guidance current.
Load-bearing premise
The key concepts, workflow-stage risks, and stage-targeted mitigation strategies drawn from existing literature are comprehensive and representative enough to guide high-stakes deployments.
What would settle it
A documented failure in a deployed agentic system whose root cause lies outside the surveyed workflow stages, risks, or mitigations and is not captured by the proposed metrics hub.
read the original abstract
Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a survey of trustworthy agentic AI that focuses on two dimensions—Safety and Robustness, and Privacy and System Security—clarifying key concepts, mapping risks to stages of the agent workflow (planning, tool use, memory, long-horizon interaction), summarizing stage-targeted mitigations, consolidating evaluation into a unified metrics-and-benchmarks hub with scenario-to-metric guidance, discussing other trustworthiness aspects as context, outlining open challenges (self-evolving agents, runtime monitoring, privacy-preserving personalization, trust-utility trade-off), and presenting a case study of real-world security failures in open-source systems.
Significance. A well-documented survey that consolidates workflow-stage risks, mitigations, and a metrics hub could serve as a practical reference for high-stakes deployments by enabling consistent comparison and release gating decisions; the inclusion of outcome and process signals (constraint violations, trace completeness, adversarial success rates) and the case study add concrete utility if coverage is representative.
major comments (1)
- [Abstract] Abstract (and implied introduction): the central claim that the survey consolidates 'key concepts, workflow-stage risks, and stage-targeted mitigation strategies' into a 'practical reference' for high-stakes environments rests on the assumption of representative coverage, yet no search methodology, database list, date range, inclusion/exclusion criteria, or PRISMA-style flow is described; this directly prevents verification that important works (e.g., on memory-augmented failure modes or tool-calling privacy leakage) were not omitted.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback highlighting the need for explicit documentation of our literature search process. We agree this is a valid point for a survey claiming representative coverage and will revise accordingly to strengthen verifiability.
read point-by-point responses
-
Referee: [Abstract] Abstract (and implied introduction): the central claim that the survey consolidates 'key concepts, workflow-stage risks, and stage-targeted mitigation strategies' into a 'practical reference' for high-stakes environments rests on the assumption of representative coverage, yet no search methodology, database list, date range, inclusion/exclusion criteria, or PRISMA-style flow is described; this directly prevents verification that important works (e.g., on memory-augmented failure modes or tool-calling privacy leakage) were not omitted.
Authors: We acknowledge that the absence of a documented search methodology limits the ability to independently verify coverage and that this is a substantive limitation for a survey positioned as a practical reference. In the revised version, we will insert a new subsection (likely in Section 1 or a dedicated 'Survey Methodology' paragraph) that explicitly describes: (1) the primary databases and repositories searched (arXiv, Google Scholar, ACL Anthology, IEEE Xplore); (2) the keyword combinations and Boolean queries used (e.g., 'agentic AI' AND ('safety' OR 'robustness' OR 'privacy' OR 'tool use' OR 'memory')); (3) the time window (primarily 2022–early 2025, with selected foundational works); (4) inclusion criteria centered on works addressing multi-step agent workflows; and (5) a high-level PRISMA-style flow summarizing screening steps. We will also note that the survey is intentionally focused rather than exhaustive. With respect to the specific examples raised, memory-augmented failure modes are addressed in the memory-stage risk subsection and tool-calling privacy leakage appears in the tool-use privacy discussion; the added methodology section will make it easier for readers to assess whether additional references should be incorporated. revision: yes
Circularity Check
No circularity: descriptive survey with no derivations or self-referential predictions
full rationale
This is a literature survey paper that organizes and summarizes external research on safety, robustness, privacy, and security in agentic AI systems. It identifies concepts, risks, mitigations, and metrics from prior work without any mathematical derivations, equations, fitted parameters, or claims that a result is predicted from first principles. No load-bearing steps reduce to self-definition, self-citation chains, or renaming of known results; the paper explicitly positions itself as a consolidation of external literature rather than an original derivation. The central claim of providing a practical reference rests on the breadth of cited works, not on any internal reduction to the paper's own inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Agents That Know Too Much: A Data-Centric Survey of Privacy in LLM Agents
A data-centric survey finds that only information-flow control covers compositional and cross-session leakage in LLM agents and that no single benchmark tests an agent across all its data surfaces under one policy.
Reference graph
Works this paper leans on
-
[1]
The rise and potential of large language model based agents: a survey
Xi Z, Chen W, Guo X, He W, Ding Y, Hong B, et al. The rise and potential of large language model based agents: a survey. Sci China Inf Sci. 2025;68(2):121101. doi: 10.1007/ s11432-024-4222-0
2025
-
[2]
FinHEAR: human expertise and adaptive risk-aware tem- poral reasoning for financial decision-making
Chen J, Zou M, Wang Z, Wang Q, Sun DD, Chi Z, et al. FinHEAR: human expertise and adaptive risk-aware tem- poral reasoning for financial decision-making. Findings of the association for computational linguistics: EMNLP 2025. Suzhou: Association for Computational Linguistics; 2025. p. 1648–72. doi: 10.18653/v1/2025.findings-emnlp.87
-
[3]
Chen G, Yang S, Li C, Liu W, Luan J, Xu Z. Heterogeneous group-based reinforcement learning for llm-based multi- agent systems. arXiv; 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2506.02718
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Zero-click ai vulnerability exposes mi- crosoft 365 copilot data without user interaction
Lakshmanan R. Zero-click ai vulnerability exposes mi- crosoft 365 copilot data without user interaction. The Hacker News. 12 June 2025 [accessed on 31 December 2025]. Available from: https://thehackernews.com/2025/0 6/zero-click-ai-vulnerability-exposes.html
2025
-
[5]
How microsoft defends against indirect prompt injection attacks
Paverd A. How microsoft defends against indirect prompt injection attacks. Microsoft Security Response Center (MSRC) Blog. 29 July 2025 [accessed on 31 December 2025]. Available from: https://www.microsoft.com/en-us/ msrc/blog/2025/07/how-microsoft-defends-against-indi rect-prompt-injection-attacks
2025
-
[7]
Llm01: prompt INJECTION—owasp genai security project
OWASP. Llm01: prompt INJECTION—owasp genai security project. online. 2024 [accessed on 31 December 2025]. Available from: https://genai.owasp.org/llmrisk2023-24/l lm01-24-prompt-injection/
2024
-
[8]
Limited effectiveness of llm-based data augmentation for COVID-19 misinformation stance detection
Choi EC, Balasubramanian A, Qi J, Ferrara E. Limited effectiveness of llm-based data augmentation for COVID-19 misinformation stance detection. In Companion Proceed- ings of the ACM on Web Conference 2025, WWW ’25; New York (NY): Association for Computing Machinery; 2025. p. 934–7. ISBN 9798400713316. doi: 10.1145/3701716.371552 1
-
[9]
From evidence to trajectory: abductive reasoning path synthesis for training retrieval-augmented generation agents
Li M, Qi J, Wu Y, Zhao M, Ma L, Li Y, et al. From evidence to trajectory: abductive reasoning path synthesis for training retrieval-augmented generation agents. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/25 09.23071
2025
-
[10]
Voyager: an open-ended embodied agent with large lan- guage models
Wang G, Xie Y, Jiang Y, Mandlekar A, Xiao C, Zhu Y, et al. Voyager: an open-ended embodied agent with large lan- guage models. Transactions on machine learning research (TMLR); 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=ehfRiF0R3a
2023
-
[11]
MemGPT: towards LLMs as operating systems
Packer C, Wooders S, Lin K, Fang V, Patil SG, Stoica I, et al. MemGPT: towards LLMs as operating systems. The twelfth international conference on learning representa- tions (ICLR). 2024 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=LeYFkQxaAK
2024
-
[12]
Agentic context engineering: evolving contexts for self-improving language models
Zhang Q, Hu C, Upasani S, Ma B, Hong F, Kamanuru V, et al. Agentic context engineering: evolving contexts for self-improving language models. arXiv; 2025 [accessed on October 2025]. Available from: https://arxiv.org/abs/2510 .04618
2025
-
[13]
Find the gap: AI, responsible agency and vulnerability
Vallor S, Vierkant T. Find the gap: AI, responsible agency and vulnerability. Minds Mach. 2024;34(3):20. doi: 10.100 7/s11023-024-09674-0
2024
-
[15]
Risk, uncertainty and AI: non- probabilistic methods for anticipating and preventing AI risks
Gutfraind A, Bier VM. Risk, uncertainty and AI: non- probabilistic methods for anticipating and preventing AI risks. Technical report, University of Illinois. 2023 [accessed on 15 January 2026]. Available from: https://www.ideals.i llinois.edu/items/129049
2023
-
[16]
Trustworthy artificial intelligence: a review
Kaur D, Uslu S, Rittichier KJ, Durresi A. Trustworthy artificial intelligence: a review. ACM Comput Surv. 2022;55 (2):1–38. doi: 10.1145/3491209
-
[17]
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Liu Y, Yao Y, Ton J-F, Zhang X, Guo R, Cheng H, et al Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. 2024 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2308.05374
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
TrustLLM: trustworthiness in large language models
Huang Y, Sun L, Wang H, Wu S, Zhang Q, Li Y, et al. TrustLLM: trustworthiness in large language models. Pro- ceedings of the 41st International Conference on Machine Learning (ICML); 21–27 July 2024; Vienna, Austria. 2024 [accessed on 15 January 2026]. Available from: https://pr oceedings.mlr.press/v235/huang24x.html
2024
-
[19]
A survey on trustworthy llm agents: threats and countermea- sures
Yu M, Meng F, Zhou X, Wang S, Mao J, Pang L, et al. A survey on trustworthy llm agents: threats and countermea- sures. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’25); 3–7 August 2025; Toronto, ON, Canada. 2025; p. 6216–26. doi: 10.1145/3711896.3736561
-
[21]
Ali MA, Dornaika F, Charafeddine J. Agentic AI: a com- prehensive survey of architectures, applications, and future directions. Artif Intell Rev. 2025;59(1):11. doi: 10.1007/s1 0462-025-11422-4
work page doi:10.1007/s1 2025
-
[22]
Llm-based agents for tool learning: a survey
Xu W, Huang C, Gao S, Shang S. Llm-based agents for tool learning: a survey. Data Sci Eng. 2025;10(4):533–63. doi: 10.1007/s41019-025-00296-9
-
[23]
Artificial intelligence: a modern ap- proach
Russell S, Norvig P. Artificial intelligence: a modern ap- proach. 4th ed. London: Pearson; 2021
2021
-
[24]
Planning and acting in partially observable stochastic domains
Kaelbling LP, Littman ML, Cassandra AR. Planning and acting in partially observable stochastic domains. Artif Intell. 1998;101(1–2):99–134. doi: 10.1016/S0004-3702(98 )00023-X
-
[25]
Reinforcement learning: an introduc- tion
Sutton RS, Barto AG. Reinforcement learning: an introduc- tion. 2nd ed. Cambridge (MA): MIT Press; 2018
2018
-
[26]
ReAct: synergizing reasoning and acting in language models
Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K, et al. ReAct: synergizing reasoning and acting in language models. The eleventh international conference on learn- ing representations (ICLR). 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id= WE_vluYUL-X
2023
-
[27]
Park JS, O’Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS. Generative agents: interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Sympo- sium on User Interface Software and Technology (UIST); 29 October–1 November 2023; San Francisco, CA, USA. 2023. doi: 10.1145/3586183.3606763
-
[28]
Artifi- cial intelligence risk management framework (AI RMF 1.0)
National Institute of Standards and Technology. Artifi- cial intelligence risk management framework (AI RMF 1.0). Technical report NIST AI 100-1; National Institute of Stan- dards and Technology (NIST). Voluntary framework for managing AI risks, guidance for trustworthy AI systems. 2023 [accessed on 15 January 2026]. Available from: https: //nvlpubs.nist....
2023
-
[29]
Retrieval-augmented generation for knowledge- intensive nlp tasks
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. Advances in neural information pro- cessing systems 33. Red Hook (NY): Curran Associates, Inc.;
-
[30]
p. 9459–74. [accessed on 15 January 2026]. Available from: https://proceedings.neurips.cc/paper/2020/file/6b4 93230205f780e1bc26945df7481e5-Paper.pdf
2026
-
[31]
World models
Ha D, Schmidhuber J. World models. Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS); 3–8 December 2018; Montréal, Canada. 2018
2018
-
[32]
Between MDPs and semi- MDPs: a framework for temporal abstraction in reinforce- ment learning
Sutton RS, Precup D, Singh S. Between MDPs and semi- MDPs: a framework for temporal abstraction in reinforce- ment learning. Artif Intell. 1999;112(1–2):181–211. doi: 10.1016/S0004-3702(99)00052-1
-
[33]
Toolformer: language models can teach themselves to use tools
Schick T, Dwivedi-Yu J, Dessí R, Raileanu R, Lomeli M, Hambro E, et al. Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th In- ternational Conference on Neural Information Processing Systems, NIPS ’23; 10–16 December 2023; New Orleans, LA, USA. Red Hook (NY): Curran Associates Inc.; 2023
2023
-
[34]
Recode-h: a benchmark for research code development with interactive human feedback
Miao C, Zou HP, Li Y, Chen Y, Wang Y, Wang F, et al. Recode-h: a benchmark for research code development with interactive human feedback. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2510.06186
-
[35]
Reflexion: language agents with verbal reinforcement learning
Shinn N, Cassano F, Berman E, Gopinath A, Narasimhan K, Yao S. Reflexion: language agents with verbal reinforcement learning. Advances in neural information processing sys- tems 36 (NeurIPS). 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=vAElhF cKW6
2023
-
[36]
A Survey of Large Language Models
Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. IEEE Access. 2024 [accessed on 15 January 2026]. Available from: https://ar xiv.org/abs/2303.18223
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Markov decision processes: discrete stochas- tic dynamic programming
Puterman ML. Markov decision processes: discrete stochas- tic dynamic programming. Wiley Series in Probability and Statistics. Hoboken (NJ): John Wiley & Sons; 1994. ISBN 9780471619772
1994
-
[38]
Multi-agent reinforcement learning: a selective overview of theories and algorithms
Zhang K, Yang Z, Basar T. Multi-agent reinforcement learning: a selective overview of theories and algorithms. Handbook of reinforcement learning and control. Cham: Springer; 2021. doi: 10.1007/978-3-030-60990-0_12
-
[40]
Human-Level Control through Deep Reinforce- ment Learning
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Belle- mare MG, et al. Human-level control through deep rein- forcement learning. Nature. 2015;518(7540):529–33. doi: 10.1038/nature14236
-
[41]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Levine S, Kumar A, Tucker G, Fu J. Offline reinforcement learning: tutorial, review, and perspectives on open prob- lems. 2020 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2005.01643
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[42]
Conservative Q-learning for offline reinforcement learning
Kumar A, Zhou A, Tucker G, Levine S. Conservative Q-learning for offline reinforcement learning. Advances in neural information processing systems 33. Red Hook (NY): Curran Associates, Inc.; 2020. p. 1179–91. [accessed on 15 January 2026]. Available from: https://papers.nips.cc/paper/2020/hash/0d2b20618 26a5df3221116a5085a6052-Paper.pdf
2020
-
[43]
Data-efficient hierarchical reinforcement learning
Nachum O, Gu S, Lee H, Levine S. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems 31 (NeurIPS 2018). Red Hook (NY): Cur- ran Associates, Inc.; 2018. p. 3307–17. [accessed on 15 Jan- uary 2026]. Available from: http://papers.nips.cc/paper/7 591-data-efficient-hierarchical-reinforcement-learning.pdf
2018
-
[44]
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
Chua K, Calandra R, McAllister R, Levine S. Deep rein- forcement learning in a handful of trials using probabilistic dynamics models. NeurIPS. 2018 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/1805.12114
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
When to trust your model: model-based policy optimization
Janner M, Fu J, Zhang M, Levine S. When to trust your model: model-based policy optimization. NeurIPS. 2019 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/1906.08253
-
[46]
Constrained markov decision processes
Altman E. Constrained markov decision processes. Boca Raton (FL): Chapman & Hall/CRC; 1999. ISBN 9780849303821
1999
-
[47]
Constrained policy optimization
Achiam J, Held D, Tamar A, Abbeel P. Constrained policy optimization. In: Precup D, Teh YW, editors. Proceedings of the 34th International Conference on Machine Learning (ICML 2017). Vol. 70. Proceedings of Machine Learning Research. PMLR. 2017. p. 22–31. [accessed on 15 January 2026]. Available from: https://proceedings.mlr.press/v70/ achiam17a.html
2017
-
[48]
A comprehensive survey on safe reinforcement learning
García J, Fernández F. A comprehensive survey on safe reinforcement learning. J Mach Learn Res. 2015;16(1): 1437–80. doi: 10.5555/2886795
-
[49]
Safe reinforcement learning via shielding
Alshiekh M, Bloem R, Ehlers R, Könighofer B, Niekum S, Topcu U. Safe reinforcement learning via shielding. Pro- ceedings of the Thirty-Second AAAI Conference on Artifi- cial Intelligence; New Orleans (LA): AAAI Press; 2018. p. 2669–78. [accessed on 15 January 2026]. Available from: ht tps://ojs.aaai.org/index.php/AAAI/article/view/11797 doi: 10.1609/aaai....
-
[50]
Deep reinforcement learning from human pref- erences
Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D. Deep reinforcement learning from human pref- erences. Advances in neural information processing sys- tems 30. Red Hook (NY): Curran Associates, Inc.; 2017. p. 4299–307. [accessed on 15 January 2026]. Available from: http://papers.nips.cc/paper/7017-deep-reinforceme nt-learning-from-human-prefer...
2017
-
[51]
Fine-Tuning Language Models from Human Preferences
Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, et al. Fine-tuning language models from human preferences. Advances in neural information processing systems 32 (NeurIPS). 2019. [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/1909.08593
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[52]
Learning to summarize from human feedback
Stiennon N, Ouyang L, Wu J, Ziegler DM, Lowe R, Voss C, et al. Learning to summarize from human feedback. Proceedings of the 34th international conference on neural information processing systems. Red Hook (NY): Curran Associates, Inc.; 2020. p. 4302–10. [accessed on 15 January 2026]. Available from: https://proceedings.neurips.cc/paper/2020/file/1f8 9885...
2020
-
[53]
Training language models to follow instructions with human feedback
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in neural information processing systems. Vol. 35. Red Hook (NY): Curran Associates, Inc.; 2022. p. 27730–44. [accessed on 15 January 2026]. ...
2022
-
[55]
Multi-objective reinforcement learning for provably incentivising alignment with value systems
Rodriguez-Soto M, Rădulescu R, Bistaffa F, Ricart O, Mayoral-Macau A, et al. Multi-objective reinforcement learning for provably incentivising alignment with value systems. Artif Intell. 2025;351:104460. doi: 10.1016/j.arti nt.2025.104460
-
[56]
An approximate embedding for designing ethical reinforcement learning environments
Mayoral Macau A, Rodríguez-Soto M, Marchesini E, Sánchez-Fibla M, López-Sánchez M, Rodríguez-Aguilar JA, et al. An approximate embedding for designing ethical reinforcement learning environments. Proceedings of the 28th European conference on artificial intelligence (ECAI), 2025 [accessed on 15 January 2026]. Available from: https://ebooks.iospress.nl/vol...
2025
-
[57]
Encoding ethics to compute value-aligned norms
Serramia M, Rodriguez-Soto M, Lopez-Sanchez M, Rodriguez-Aguilar JA, Bistaffa F, Boddington P, et al. Encoding ethics to compute value-aligned norms. Minds Mach. 2023;33(4):761–90. doi: 10.1007/s11023-023-09649-7
-
[58]
Direct preference optimization: Your language model is secretly a reward model
Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C. Direct preference optimization: Your language model is secretly a reward model. Thirty-seventh conference on neural information processing systems. 2023 [accessed on 15 January 2026]. Available from: https://openreview.n et/forum?id=HPuSIXJaa9
2023
-
[59]
Model alignment as prospect theoretic optimization
Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. Model alignment as prospect theoretic optimization. Pro- ceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org; 21–27 July 2024; Vienna, Austria. 2024
2024
-
[61]
Domain randomization for transferring deep neural networks from simulation to the real world
Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P. Domain randomization for transferring deep neural networks from simulation to the real world. Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); 24–28 September 2017; Van- couver, BC, Canada. 2017. p. 23–30. doi: 10.1109/IROS.201 7.8202133
-
[62]
Concrete Problems in AI Safety
Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D. Concrete problems in AI safety. arXiv; 2016 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/1606.06565
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[63]
Improving generalization in game agents with data augmen- tation in imitation learning
Yadgaroff D, Sestini A, Tollmar K, Ozcelikkale A, Gisslén L. Improving generalization in game agents with data augmen- tation in imitation learning. 2023 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2309.12815
-
[64]
Harmbench: a standardized evaluation framework for automated red teaming and robust refusal
Mazeika M, Phan L, Yin X, Zou A, Wang Z, Mu N, et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. Proceedings of the 41st International Conference on Machine Learn- ing, ICML’24. JMLR.org; 21–27 July 2024; Vienna, Austria. 2024
2024
-
[65]
Out-of-distribution detection for reinforcement learn- ing agents with probabilistic dynamics models
Haider T, Roscher K, Schmoeller da Roza F, Günnemann S. Out-of-distribution detection for reinforcement learn- ing agents with probabilistic dynamics models. Proceed- ings of the 2023 International Conference on Autonomous Agents and Multiagent Systems; Richland (SC): Interna- tional Foundation for Autonomous Agents and Multiagent Systems; 2023. p. 851–9....
2023
-
[66]
Distribution- ally robust neural networks for group shifts: on the impor- tance of regularization for worst-case generalization
Sagawa S, Koh PW, Hashimoto TB, Liang P. Distribution- ally robust neural networks for group shifts: on the impor- tance of regularization for worst-case generalization. Inter- national Conference on Learning Representations (ICLR). 2020 [accessed on 15 January 2026]. Available from: https: //openreview.net/forum?id=ryxGuJrFvS
2020
-
[67]
Constitutional AI: Harmlessness from AI Feedback
Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, et al. Constitutional AI: harmlessness from AI feedback. 2022 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/2212.08073
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
Kushwaha A, Ravish K, Lamba P, Kumar P. A survey of safe reinforcement learning and constrained mdps: a tech- nical survey on single-agent and multi-agent safety. 2025 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/2505.17342
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Webguard: building a generalizable guardrail for web agents
Zheng B, Liao Z, Salisbury S, Liu Z, Lin M, Zheng Q, et al. Webguard: building a generalizable guardrail for web agents. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2507.14293
-
[70]
Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems
Aswath Duvvuru VS, Zhang B, Vierhauser M, Agrawal A. Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems. Proceedings of the IEEE/ACM 47th International Conference on Software En- gineering, ICSE ’25; 27 April–3 May 2025; Ottawa, ON, Canada. Hoboken (NJ): IEEE Press; 2025. p. 385–97. ISBN 9798331505691. doi: 10.1...
-
[71]
The temporal logic of programs
Pnueli A. The temporal logic of programs. Proceedings of the 18th Annual Symposium on Foundations of Computer Sci- ence (sfcs 1977); 31 October–2 November 1977; Providence, RI, USA. 1977; p. 46–57. doi: 10.1109/SFCS.1977.32
-
[72]
Hidden technical debt in machine learning sys- tems
Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, et al. Hidden technical debt in machine learning sys- tems. In Advances in neural information processing systems (NeurIPS). Cambridge (MA): MIT Press; 2015. p. 2503–11
2015
-
[73]
The ML test score: a rubric for ML production readiness and technical debt reduction
Breck E, Cai S, Nielsen E, Salib M, Sculley D. The ML test score: a rubric for ML production readiness and technical debt reduction. Proceedings of the 2017 IEEE International Conference on Big Data (Big Data); 11–14 December 2017; Boston, MA, USA. 2017. doi: 10.1109/BigData.2017.82580 38
-
[74]
Continuous delivery with spinnaker: fast, safe, repeatable multi-cloud deployments
Burns E, Feldman A, Fletcher R, Lin T, Reynolds J, Sanden C, et al. Continuous delivery with spinnaker: fast, safe, repeatable multi-cloud deployments. Chapter 8: automated canary analysis. Sebastopol (CA): O’Reilly Media; 2018
2018
-
[75]
Safe policy im- provement with baseline bootstrapping
Laroche R, Trichelair P, des Combes RT. Safe policy im- provement with baseline bootstrapping. Proceedings of the 36th International Conference on Machine Learning (ICML); 9–15 June 2019; Long Beach, CA, USA. Vol. 97 of proceedings of machine learning research. PMLR. 2019. p. 3652–61. doi: 10.1007/978-3-030-46133-1_4. Available from: https: //proceedings....
-
[76]
Open problems in cooperative AI
Dafoe A, Hughes E, Bachrach Y, Collins T, McKee KR, Leibo JZ, et al. Open problems in cooperative AI. 2020 [accessed on 15 January 2026]. Available from: https://arxiv.org/ab s/2012.08630
-
[77]
Searching for Privacy Risks in LLM Agents via Simulation
Zhang Y, Yang D. Searching for privacy risks in llm agents via simulation. 2025 [accessed on 15 January 2026]. Avail- able from: https://arxiv.org/abs/2508.10880
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[78]
Beyond data privacy: new privacy risks for large language models
Du Y, Li Z, Li N, Ding B. Beyond data privacy: new privacy risks for large language models. 2025 [accessed on 15 Jan- uary 2026]. Available from: https://arxiv.org/abs/2509.142 78
2025
-
[79]
Zero trust architecture
Rose S, Borchert O, Mitchell S, Connelly S. Zero trust architecture. Technical report NIST special publication 800-207. Gaithersburg (MD): National Institute of Stan- dards and Technology; 2020 [accessed on 15 January 2026]. Available from: https://csrc.nist.gov/publications/d etail/sp/800-207/final
2020
-
[81]
Privacy as contextual integrity
Nissenbaum H. Privacy as contextual integrity. Wash Law Rev. 2004;79(1):119. [accessed on 15 January 2026]. Avail- able from: https://digitalcommons.law.uw.edu/wlr/vol79/ iss1/10
2004
-
[82]
Privweb: unobtrusive and content-aware privacy protection for web agents
Zhang S, Jiang Y, Ma R, Yang Y, Xu M, Huang Z, et al. Privweb: unobtrusive and content-aware privacy protection for web agents. In CHI ’26: proceedings of the 2026 CHI conference on human factors in computing systems. New York (NY): Association for Computing Machinery; 2025
2026
-
[83]
Mla-trust: benchmarking trustworthiness of multimodal llm agents in gui environments
Yang X, Chen J, Luo J, Fang Z, Dong Y, Su H, et al. Mla-trust: benchmarking trustworthiness of multimodal llm agents in gui environments. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2506.01616
-
[84]
Spdx specification
SPDX Workgroup. Spdx specification. The Linux Founda- tion. 2021 [accessed on 15 January 2026]. Available from: https://spdx.dev/specifications/
2021
-
[85]
Sigstore: software signing for every- body
The Sigstore Project. Sigstore: software signing for every- body. The Linux Foundation. 2022 [accessed on 15 January 2026]. Available from: https://www.sigstore.dev/
2022
-
[86]
AI safety vs
Lin Z, Sun H, Shroff N. AI safety vs. AI security: demysti- fying the distinction and boundaries. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2502 .13175
2025
-
[87]
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Ma X, Gao Y, Wang Y, Wang R, Wang X, Sun Y, et al. Safety at scale: a comprehensive survey of large model and agent safety. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2502.05206
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.