pith. sign in

arxiv: 2605.23989 · v1 · pith:73VPCPPDnew · submitted 2026-05-17 · 💻 cs.AI · cs.CL· cs.CR

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

Pith reviewed 2026-06-30 19:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CR
keywords agentic AItrustworthy AIsafetyrobustnessprivacysystem securityLLM agentsbenchmarks
0
0 comments X

The pith

Agentic AI systems require stage-specific safeguards for safety, robustness, privacy, and security to handle multi-step failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines trustworthy agentic AI, defined as large language models augmented with planning, tool use, memory, and long-horizon interactions that execute tasks autonomously. It focuses on two dimensions critical for high-risk uses—Safety and Robustness, and Privacy and System Security—by defining key concepts, locating risks along the agent workflow, and summarizing targeted mitigation strategies from the literature. A unified metrics-and-benchmarks hub is assembled to enable consistent evaluation using both outcome and process signals, along with scenario-to-metric guidance for deployment decisions. Other trustworthiness aspects receive only contextual mention. The work concludes with open challenges and a case study of security failures in open-source systems to serve as a practical reference.

Core claim

Agentic AI introduces new failure modes through multi-step trajectories, and the survey addresses trustworthiness by mapping risks to workflow stages within Safety and Robustness as well as Privacy and System Security, while consolidating evaluation resources into a single metrics-and-benchmarks hub that includes scenario guidance for release decisions.

What carries the argument

The unified metrics-and-benchmarks hub that consolidates outcome and process signals (such as constraint violations, trace completeness, and adversarial success rates) with scenario-to-metric guidance for evaluation.

If this is right

  • Workflow-stage risk mapping enables targeted interventions that reduce specific failure modes such as constraint violations.
  • The metrics-and-benchmarks hub supports consistent comparison across systems using both outcome and process measures.
  • Scenario-to-metric guidance improves release gating by linking evaluation signals directly to deployment contexts.
  • Attention to listed open challenges like runtime monitoring will be required to maintain trustworthiness as agents evolve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The survey's workflow focus could be extended to test whether adding verification checkpoints at each stage measurably lowers adversarial success rates in controlled experiments.
  • The case study of open-source failures implies that public disclosure of attack traces might accelerate community development of the metrics hub.
  • Treating the hub as a living resource would require periodic updates tied to new agent architectures to keep the scenario guidance current.

Load-bearing premise

The key concepts, workflow-stage risks, and stage-targeted mitigation strategies drawn from existing literature are comprehensive and representative enough to guide high-stakes deployments.

What would settle it

A documented failure in a deployed agentic system whose root cause lies outside the surveyed workflow stages, risks, or mitigations and is not captured by the proposed metrics hub.

read the original abstract

Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper is a survey of trustworthy agentic AI that focuses on two dimensions—Safety and Robustness, and Privacy and System Security—clarifying key concepts, mapping risks to stages of the agent workflow (planning, tool use, memory, long-horizon interaction), summarizing stage-targeted mitigations, consolidating evaluation into a unified metrics-and-benchmarks hub with scenario-to-metric guidance, discussing other trustworthiness aspects as context, outlining open challenges (self-evolving agents, runtime monitoring, privacy-preserving personalization, trust-utility trade-off), and presenting a case study of real-world security failures in open-source systems.

Significance. A well-documented survey that consolidates workflow-stage risks, mitigations, and a metrics hub could serve as a practical reference for high-stakes deployments by enabling consistent comparison and release gating decisions; the inclusion of outcome and process signals (constraint violations, trace completeness, adversarial success rates) and the case study add concrete utility if coverage is representative.

major comments (1)
  1. [Abstract] Abstract (and implied introduction): the central claim that the survey consolidates 'key concepts, workflow-stage risks, and stage-targeted mitigation strategies' into a 'practical reference' for high-stakes environments rests on the assumption of representative coverage, yet no search methodology, database list, date range, inclusion/exclusion criteria, or PRISMA-style flow is described; this directly prevents verification that important works (e.g., on memory-augmented failure modes or tool-calling privacy leakage) were not omitted.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback highlighting the need for explicit documentation of our literature search process. We agree this is a valid point for a survey claiming representative coverage and will revise accordingly to strengthen verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and implied introduction): the central claim that the survey consolidates 'key concepts, workflow-stage risks, and stage-targeted mitigation strategies' into a 'practical reference' for high-stakes environments rests on the assumption of representative coverage, yet no search methodology, database list, date range, inclusion/exclusion criteria, or PRISMA-style flow is described; this directly prevents verification that important works (e.g., on memory-augmented failure modes or tool-calling privacy leakage) were not omitted.

    Authors: We acknowledge that the absence of a documented search methodology limits the ability to independently verify coverage and that this is a substantive limitation for a survey positioned as a practical reference. In the revised version, we will insert a new subsection (likely in Section 1 or a dedicated 'Survey Methodology' paragraph) that explicitly describes: (1) the primary databases and repositories searched (arXiv, Google Scholar, ACL Anthology, IEEE Xplore); (2) the keyword combinations and Boolean queries used (e.g., 'agentic AI' AND ('safety' OR 'robustness' OR 'privacy' OR 'tool use' OR 'memory')); (3) the time window (primarily 2022–early 2025, with selected foundational works); (4) inclusion criteria centered on works addressing multi-step agent workflows; and (5) a high-level PRISMA-style flow summarizing screening steps. We will also note that the survey is intentionally focused rather than exhaustive. With respect to the specific examples raised, memory-augmented failure modes are addressed in the memory-stage risk subsection and tool-calling privacy leakage appears in the tool-use privacy discussion; the added methodology section will make it easier for readers to assess whether additional references should be incorporated. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive survey with no derivations or self-referential predictions

full rationale

This is a literature survey paper that organizes and summarizes external research on safety, robustness, privacy, and security in agentic AI systems. It identifies concepts, risks, mitigations, and metrics from prior work without any mathematical derivations, equations, fitted parameters, or claims that a result is predicted from first principles. No load-bearing steps reduce to self-definition, self-citation chains, or renaming of known results; the paper explicitly positions itself as a consolidation of external literature rather than an original derivation. The central claim of providing a practical reference rests on the breadth of cited works, not on any internal reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper with no new mathematical models, empirical claims, or derivations; therefore no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5802 in / 1076 out tokens · 35588 ms · 2026-06-30T19:16:26.863304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agents That Know Too Much: A Data-Centric Survey of Privacy in LLM Agents

    cs.CR 2026-06 unverdicted novelty 5.0

    A data-centric survey finds that only information-flow control covers compositional and cross-session leakage in LLM agents and that no single benchmark tests an agent across all its data surfaces under one policy.

Reference graph

Works this paper leans on

203 extracted references · 104 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    The rise and potential of large language model based agents: a survey

    Xi Z, Chen W, Guo X, He W, Ding Y, Hong B, et al. The rise and potential of large language model based agents: a survey. Sci China Inf Sci. 2025;68(2):121101. doi: 10.1007/ s11432-024-4222-0

  2. [2]

    FinHEAR: human expertise and adaptive risk-aware tem- poral reasoning for financial decision-making

    Chen J, Zou M, Wang Z, Wang Q, Sun DD, Chi Z, et al. FinHEAR: human expertise and adaptive risk-aware tem- poral reasoning for financial decision-making. Findings of the association for computational linguistics: EMNLP 2025. Suzhou: Association for Computational Linguistics; 2025. p. 1648–72. doi: 10.18653/v1/2025.findings-emnlp.87

  3. [3]

    End-to-End Optimization of LLM-Driven Multi-Agent Search Systems via Heterogeneous-Group-Based Reinforcement Learning

    Chen G, Yang S, Li C, Liu W, Luan J, Xu Z. Heterogeneous group-based reinforcement learning for llm-based multi- agent systems. arXiv; 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2506.02718

  4. [4]

    Zero-click ai vulnerability exposes mi- crosoft 365 copilot data without user interaction

    Lakshmanan R. Zero-click ai vulnerability exposes mi- crosoft 365 copilot data without user interaction. The Hacker News. 12 June 2025 [accessed on 31 December 2025]. Available from: https://thehackernews.com/2025/0 6/zero-click-ai-vulnerability-exposes.html

  5. [5]

    How microsoft defends against indirect prompt injection attacks

    Paverd A. How microsoft defends against indirect prompt injection attacks. Microsoft Security Response Center (MSRC) Blog. 29 July 2025 [accessed on 31 December 2025]. Available from: https://www.microsoft.com/en-us/ msrc/blog/2025/07/how-microsoft-defends-against-indi rect-prompt-injection-attacks

  6. [7]

    Llm01: prompt INJECTION—owasp genai security project

    OWASP. Llm01: prompt INJECTION—owasp genai security project. online. 2024 [accessed on 31 December 2025]. Available from: https://genai.owasp.org/llmrisk2023-24/l lm01-24-prompt-injection/

  7. [8]

    Limited effectiveness of llm-based data augmentation for COVID-19 misinformation stance detection

    Choi EC, Balasubramanian A, Qi J, Ferrara E. Limited effectiveness of llm-based data augmentation for COVID-19 misinformation stance detection. In Companion Proceed- ings of the ACM on Web Conference 2025, WWW ’25; New York (NY): Association for Computing Machinery; 2025. p. 934–7. ISBN 9798400713316. doi: 10.1145/3701716.371552 1

  8. [9]

    From evidence to trajectory: abductive reasoning path synthesis for training retrieval-augmented generation agents

    Li M, Qi J, Wu Y, Zhao M, Ma L, Li Y, et al. From evidence to trajectory: abductive reasoning path synthesis for training retrieval-augmented generation agents. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/25 09.23071

  9. [10]

    Voyager: an open-ended embodied agent with large lan- guage models

    Wang G, Xie Y, Jiang Y, Mandlekar A, Xiao C, Zhu Y, et al. Voyager: an open-ended embodied agent with large lan- guage models. Transactions on machine learning research (TMLR); 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=ehfRiF0R3a

  10. [11]

    MemGPT: towards LLMs as operating systems

    Packer C, Wooders S, Lin K, Fang V, Patil SG, Stoica I, et al. MemGPT: towards LLMs as operating systems. The twelfth international conference on learning representa- tions (ICLR). 2024 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=LeYFkQxaAK

  11. [12]

    Agentic context engineering: evolving contexts for self-improving language models

    Zhang Q, Hu C, Upasani S, Ma B, Hong F, Kamanuru V, et al. Agentic context engineering: evolving contexts for self-improving language models. arXiv; 2025 [accessed on October 2025]. Available from: https://arxiv.org/abs/2510 .04618

  12. [13]

    Find the gap: AI, responsible agency and vulnerability

    Vallor S, Vierkant T. Find the gap: AI, responsible agency and vulnerability. Minds Mach. 2024;34(3):20. doi: 10.100 7/s11023-024-09674-0

  13. [15]

    Risk, uncertainty and AI: non- probabilistic methods for anticipating and preventing AI risks

    Gutfraind A, Bier VM. Risk, uncertainty and AI: non- probabilistic methods for anticipating and preventing AI risks. Technical report, University of Illinois. 2023 [accessed on 15 January 2026]. Available from: https://www.ideals.i llinois.edu/items/129049

  14. [16]

    Trustworthy artificial intelligence: a review

    Kaur D, Uslu S, Rittichier KJ, Durresi A. Trustworthy artificial intelligence: a review. ACM Comput Surv. 2022;55 (2):1–38. doi: 10.1145/3491209

  15. [17]

    Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    Liu Y, Yao Y, Ton J-F, Zhang X, Guo R, Cheng H, et al Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. 2024 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2308.05374

  16. [18]

    TrustLLM: trustworthiness in large language models

    Huang Y, Sun L, Wang H, Wu S, Zhang Q, Li Y, et al. TrustLLM: trustworthiness in large language models. Pro- ceedings of the 41st International Conference on Machine Learning (ICML); 21–27 July 2024; Vienna, Austria. 2024 [accessed on 15 January 2026]. Available from: https://pr oceedings.mlr.press/v235/huang24x.html

  17. [19]

    A survey on trustworthy llm agents: threats and countermea- sures

    Yu M, Meng F, Zhou X, Wang S, Mao J, Pang L, et al. A survey on trustworthy llm agents: threats and countermea- sures. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’25); 3–7 August 2025; Toronto, ON, Canada. 2025; p. 6216–26. doi: 10.1145/3711896.3736561

  18. [21]

    Pennec, P

    Ali MA, Dornaika F, Charafeddine J. Agentic AI: a com- prehensive survey of architectures, applications, and future directions. Artif Intell Rev. 2025;59(1):11. doi: 10.1007/s1 0462-025-11422-4

  19. [22]

    Llm-based agents for tool learning: a survey

    Xu W, Huang C, Gao S, Shang S. Llm-based agents for tool learning: a survey. Data Sci Eng. 2025;10(4):533–63. doi: 10.1007/s41019-025-00296-9

  20. [23]

    Artificial intelligence: a modern ap- proach

    Russell S, Norvig P. Artificial intelligence: a modern ap- proach. 4th ed. London: Pearson; 2021

  21. [24]

    Planning and acting in partially observable stochastic domains

    Kaelbling LP, Littman ML, Cassandra AR. Planning and acting in partially observable stochastic domains. Artif Intell. 1998;101(1–2):99–134. doi: 10.1016/S0004-3702(98 )00023-X

  22. [25]

    Reinforcement learning: an introduc- tion

    Sutton RS, Barto AG. Reinforcement learning: an introduc- tion. 2nd ed. Cambridge (MA): MIT Press; 2018

  23. [26]

    ReAct: synergizing reasoning and acting in language models

    Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K, et al. ReAct: synergizing reasoning and acting in language models. The eleventh international conference on learn- ing representations (ICLR). 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id= WE_vluYUL-X

  24. [27]

    Bernstein

    Park JS, O’Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS. Generative agents: interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Sympo- sium on User Interface Software and Technology (UIST); 29 October–1 November 2023; San Francisco, CA, USA. 2023. doi: 10.1145/3586183.3606763

  25. [28]

    Artifi- cial intelligence risk management framework (AI RMF 1.0)

    National Institute of Standards and Technology. Artifi- cial intelligence risk management framework (AI RMF 1.0). Technical report NIST AI 100-1; National Institute of Stan- dards and Technology (NIST). Voluntary framework for managing AI risks, guidance for trustworthy AI systems. 2023 [accessed on 15 January 2026]. Available from: https: //nvlpubs.nist....

  26. [29]

    Retrieval-augmented generation for knowledge- intensive nlp tasks

    Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. Advances in neural information pro- cessing systems 33. Red Hook (NY): Curran Associates, Inc.;

  27. [30]

    p. 9459–74. [accessed on 15 January 2026]. Available from: https://proceedings.neurips.cc/paper/2020/file/6b4 93230205f780e1bc26945df7481e5-Paper.pdf

  28. [31]

    World models

    Ha D, Schmidhuber J. World models. Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS); 3–8 December 2018; Montréal, Canada. 2018

  29. [32]

    Between MDPs and semi- MDPs: a framework for temporal abstraction in reinforce- ment learning

    Sutton RS, Precup D, Singh S. Between MDPs and semi- MDPs: a framework for temporal abstraction in reinforce- ment learning. Artif Intell. 1999;112(1–2):181–211. doi: 10.1016/S0004-3702(99)00052-1

  30. [33]

    Toolformer: language models can teach themselves to use tools

    Schick T, Dwivedi-Yu J, Dessí R, Raileanu R, Lomeli M, Hambro E, et al. Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th In- ternational Conference on Neural Information Processing Systems, NIPS ’23; 10–16 December 2023; New Orleans, LA, USA. Red Hook (NY): Curran Associates Inc.; 2023

  31. [34]

    Recode-h: a benchmark for research code development with interactive human feedback

    Miao C, Zou HP, Li Y, Chen Y, Wang Y, Wang F, et al. Recode-h: a benchmark for research code development with interactive human feedback. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2510.06186

  32. [35]

    Reflexion: language agents with verbal reinforcement learning

    Shinn N, Cassano F, Berman E, Gopinath A, Narasimhan K, Yao S. Reflexion: language agents with verbal reinforcement learning. Advances in neural information processing sys- tems 36 (NeurIPS). 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=vAElhF cKW6

  33. [36]

    A Survey of Large Language Models

    Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. IEEE Access. 2024 [accessed on 15 January 2026]. Available from: https://ar xiv.org/abs/2303.18223

  34. [37]

    Markov decision processes: discrete stochas- tic dynamic programming

    Puterman ML. Markov decision processes: discrete stochas- tic dynamic programming. Wiley Series in Probability and Statistics. Hoboken (NJ): John Wiley & Sons; 1994. ISBN 9780471619772

  35. [38]

    Multi-agent reinforcement learning: a selective overview of theories and algorithms

    Zhang K, Yang Z, Basar T. Multi-agent reinforcement learning: a selective overview of theories and algorithms. Handbook of reinforcement learning and control. Cham: Springer; 2021. doi: 10.1007/978-3-030-60990-0_12

  36. [40]

    Human-Level Control through Deep Reinforce- ment Learning

    Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Belle- mare MG, et al. Human-level control through deep rein- forcement learning. Nature. 2015;518(7540):529–33. doi: 10.1038/nature14236

  37. [41]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Levine S, Kumar A, Tucker G, Fu J. Offline reinforcement learning: tutorial, review, and perspectives on open prob- lems. 2020 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2005.01643

  38. [42]

    Conservative Q-learning for offline reinforcement learning

    Kumar A, Zhou A, Tucker G, Levine S. Conservative Q-learning for offline reinforcement learning. Advances in neural information processing systems 33. Red Hook (NY): Curran Associates, Inc.; 2020. p. 1179–91. [accessed on 15 January 2026]. Available from: https://papers.nips.cc/paper/2020/hash/0d2b20618 26a5df3221116a5085a6052-Paper.pdf

  39. [43]

    Data-efficient hierarchical reinforcement learning

    Nachum O, Gu S, Lee H, Levine S. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems 31 (NeurIPS 2018). Red Hook (NY): Cur- ran Associates, Inc.; 2018. p. 3307–17. [accessed on 15 Jan- uary 2026]. Available from: http://papers.nips.cc/paper/7 591-data-efficient-hierarchical-reinforcement-learning.pdf

  40. [44]

    Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

    Chua K, Calandra R, McAllister R, Levine S. Deep rein- forcement learning in a handful of trials using probabilistic dynamics models. NeurIPS. 2018 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/1805.12114

  41. [45]

    When to trust your model: model-based policy optimization

    Janner M, Fu J, Zhang M, Levine S. When to trust your model: model-based policy optimization. NeurIPS. 2019 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/1906.08253

  42. [46]

    Constrained markov decision processes

    Altman E. Constrained markov decision processes. Boca Raton (FL): Chapman & Hall/CRC; 1999. ISBN 9780849303821

  43. [47]

    Constrained policy optimization

    Achiam J, Held D, Tamar A, Abbeel P. Constrained policy optimization. In: Precup D, Teh YW, editors. Proceedings of the 34th International Conference on Machine Learning (ICML 2017). Vol. 70. Proceedings of Machine Learning Research. PMLR. 2017. p. 22–31. [accessed on 15 January 2026]. Available from: https://proceedings.mlr.press/v70/ achiam17a.html

  44. [48]

    A comprehensive survey on safe reinforcement learning

    García J, Fernández F. A comprehensive survey on safe reinforcement learning. J Mach Learn Res. 2015;16(1): 1437–80. doi: 10.5555/2886795

  45. [49]

    Safe reinforcement learning via shielding

    Alshiekh M, Bloem R, Ehlers R, Könighofer B, Niekum S, Topcu U. Safe reinforcement learning via shielding. Pro- ceedings of the Thirty-Second AAAI Conference on Artifi- cial Intelligence; New Orleans (LA): AAAI Press; 2018. p. 2669–78. [accessed on 15 January 2026]. Available from: ht tps://ojs.aaai.org/index.php/AAAI/article/view/11797 doi: 10.1609/aaai....

  46. [50]

    Deep reinforcement learning from human pref- erences

    Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D. Deep reinforcement learning from human pref- erences. Advances in neural information processing sys- tems 30. Red Hook (NY): Curran Associates, Inc.; 2017. p. 4299–307. [accessed on 15 January 2026]. Available from: http://papers.nips.cc/paper/7017-deep-reinforceme nt-learning-from-human-prefer...

  47. [51]

    Fine-Tuning Language Models from Human Preferences

    Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, et al. Fine-tuning language models from human preferences. Advances in neural information processing systems 32 (NeurIPS). 2019. [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/1909.08593

  48. [52]

    Learning to summarize from human feedback

    Stiennon N, Ouyang L, Wu J, Ziegler DM, Lowe R, Voss C, et al. Learning to summarize from human feedback. Proceedings of the 34th international conference on neural information processing systems. Red Hook (NY): Curran Associates, Inc.; 2020. p. 4302–10. [accessed on 15 January 2026]. Available from: https://proceedings.neurips.cc/paper/2020/file/1f8 9885...

  49. [53]

    Training language models to follow instructions with human feedback

    Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in neural information processing systems. Vol. 35. Red Hook (NY): Curran Associates, Inc.; 2022. p. 27730–44. [accessed on 15 January 2026]. ...

  50. [55]

    Multi-objective reinforcement learning for provably incentivising alignment with value systems

    Rodriguez-Soto M, Rădulescu R, Bistaffa F, Ricart O, Mayoral-Macau A, et al. Multi-objective reinforcement learning for provably incentivising alignment with value systems. Artif Intell. 2025;351:104460. doi: 10.1016/j.arti nt.2025.104460

  51. [56]

    An approximate embedding for designing ethical reinforcement learning environments

    Mayoral Macau A, Rodríguez-Soto M, Marchesini E, Sánchez-Fibla M, López-Sánchez M, Rodríguez-Aguilar JA, et al. An approximate embedding for designing ethical reinforcement learning environments. Proceedings of the 28th European conference on artificial intelligence (ECAI), 2025 [accessed on 15 January 2026]. Available from: https://ebooks.iospress.nl/vol...

  52. [57]

    Encoding ethics to compute value-aligned norms

    Serramia M, Rodriguez-Soto M, Lopez-Sanchez M, Rodriguez-Aguilar JA, Bistaffa F, Boddington P, et al. Encoding ethics to compute value-aligned norms. Minds Mach. 2023;33(4):761–90. doi: 10.1007/s11023-023-09649-7

  53. [58]

    Direct preference optimization: Your language model is secretly a reward model

    Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C. Direct preference optimization: Your language model is secretly a reward model. Thirty-seventh conference on neural information processing systems. 2023 [accessed on 15 January 2026]. Available from: https://openreview.n et/forum?id=HPuSIXJaa9

  54. [59]

    Model alignment as prospect theoretic optimization

    Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. Model alignment as prospect theoretic optimization. Pro- ceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org; 21–27 July 2024; Vienna, Austria. 2024

  55. [61]

    Domain randomization for transferring deep neural networks from simulation to the real world

    Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P. Domain randomization for transferring deep neural networks from simulation to the real world. Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); 24–28 September 2017; Van- couver, BC, Canada. 2017. p. 23–30. doi: 10.1109/IROS.201 7.8202133

  56. [62]

    Concrete Problems in AI Safety

    Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D. Concrete problems in AI safety. arXiv; 2016 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/1606.06565

  57. [63]

    Improving generalization in game agents with data augmen- tation in imitation learning

    Yadgaroff D, Sestini A, Tollmar K, Ozcelikkale A, Gisslén L. Improving generalization in game agents with data augmen- tation in imitation learning. 2023 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2309.12815

  58. [64]

    Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

    Mazeika M, Phan L, Yin X, Zou A, Wang Z, Mu N, et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. Proceedings of the 41st International Conference on Machine Learn- ing, ICML’24. JMLR.org; 21–27 July 2024; Vienna, Austria. 2024

  59. [65]

    Out-of-distribution detection for reinforcement learn- ing agents with probabilistic dynamics models

    Haider T, Roscher K, Schmoeller da Roza F, Günnemann S. Out-of-distribution detection for reinforcement learn- ing agents with probabilistic dynamics models. Proceed- ings of the 2023 International Conference on Autonomous Agents and Multiagent Systems; Richland (SC): Interna- tional Foundation for Autonomous Agents and Multiagent Systems; 2023. p. 851–9....

  60. [66]

    Distribution- ally robust neural networks for group shifts: on the impor- tance of regularization for worst-case generalization

    Sagawa S, Koh PW, Hashimoto TB, Liang P. Distribution- ally robust neural networks for group shifts: on the impor- tance of regularization for worst-case generalization. Inter- national Conference on Learning Representations (ICLR). 2020 [accessed on 15 January 2026]. Available from: https: //openreview.net/forum?id=ryxGuJrFvS

  61. [67]

    Constitutional AI: Harmlessness from AI Feedback

    Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, et al. Constitutional AI: harmlessness from AI feedback. 2022 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/2212.08073

  62. [68]

    A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

    Kushwaha A, Ravish K, Lamba P, Kumar P. A survey of safe reinforcement learning and constrained mdps: a tech- nical survey on single-agent and multi-agent safety. 2025 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/2505.17342

  63. [69]

    Webguard: building a generalizable guardrail for web agents

    Zheng B, Liao Z, Salisbury S, Liu Z, Lin M, Zheng Q, et al. Webguard: building a generalizable guardrail for web agents. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2507.14293

  64. [70]

    Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems

    Aswath Duvvuru VS, Zhang B, Vierhauser M, Agrawal A. Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems. Proceedings of the IEEE/ACM 47th International Conference on Software En- gineering, ICSE ’25; 27 April–3 May 2025; Ottawa, ON, Canada. Hoboken (NJ): IEEE Press; 2025. p. 385–97. ISBN 9798331505691. doi: 10.1...

  65. [71]

    The temporal logic of programs

    Pnueli A. The temporal logic of programs. Proceedings of the 18th Annual Symposium on Foundations of Computer Sci- ence (sfcs 1977); 31 October–2 November 1977; Providence, RI, USA. 1977; p. 46–57. doi: 10.1109/SFCS.1977.32

  66. [72]

    Hidden technical debt in machine learning sys- tems

    Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, et al. Hidden technical debt in machine learning sys- tems. In Advances in neural information processing systems (NeurIPS). Cambridge (MA): MIT Press; 2015. p. 2503–11

  67. [73]

    The ML test score: a rubric for ML production readiness and technical debt reduction

    Breck E, Cai S, Nielsen E, Salib M, Sculley D. The ML test score: a rubric for ML production readiness and technical debt reduction. Proceedings of the 2017 IEEE International Conference on Big Data (Big Data); 11–14 December 2017; Boston, MA, USA. 2017. doi: 10.1109/BigData.2017.82580 38

  68. [74]

    Continuous delivery with spinnaker: fast, safe, repeatable multi-cloud deployments

    Burns E, Feldman A, Fletcher R, Lin T, Reynolds J, Sanden C, et al. Continuous delivery with spinnaker: fast, safe, repeatable multi-cloud deployments. Chapter 8: automated canary analysis. Sebastopol (CA): O’Reilly Media; 2018

  69. [75]

    Safe policy im- provement with baseline bootstrapping

    Laroche R, Trichelair P, des Combes RT. Safe policy im- provement with baseline bootstrapping. Proceedings of the 36th International Conference on Machine Learning (ICML); 9–15 June 2019; Long Beach, CA, USA. Vol. 97 of proceedings of machine learning research. PMLR. 2019. p. 3652–61. doi: 10.1007/978-3-030-46133-1_4. Available from: https: //proceedings....

  70. [76]

    Open problems in cooperative AI

    Dafoe A, Hughes E, Bachrach Y, Collins T, McKee KR, Leibo JZ, et al. Open problems in cooperative AI. 2020 [accessed on 15 January 2026]. Available from: https://arxiv.org/ab s/2012.08630

  71. [77]

    Searching for Privacy Risks in LLM Agents via Simulation

    Zhang Y, Yang D. Searching for privacy risks in llm agents via simulation. 2025 [accessed on 15 January 2026]. Avail- able from: https://arxiv.org/abs/2508.10880

  72. [78]

    Beyond data privacy: new privacy risks for large language models

    Du Y, Li Z, Li N, Ding B. Beyond data privacy: new privacy risks for large language models. 2025 [accessed on 15 Jan- uary 2026]. Available from: https://arxiv.org/abs/2509.142 78

  73. [79]

    Zero trust architecture

    Rose S, Borchert O, Mitchell S, Connelly S. Zero trust architecture. Technical report NIST special publication 800-207. Gaithersburg (MD): National Institute of Stan- dards and Technology; 2020 [accessed on 15 January 2026]. Available from: https://csrc.nist.gov/publications/d etail/sp/800-207/final

  74. [81]

    Privacy as contextual integrity

    Nissenbaum H. Privacy as contextual integrity. Wash Law Rev. 2004;79(1):119. [accessed on 15 January 2026]. Avail- able from: https://digitalcommons.law.uw.edu/wlr/vol79/ iss1/10

  75. [82]

    Privweb: unobtrusive and content-aware privacy protection for web agents

    Zhang S, Jiang Y, Ma R, Yang Y, Xu M, Huang Z, et al. Privweb: unobtrusive and content-aware privacy protection for web agents. In CHI ’26: proceedings of the 2026 CHI conference on human factors in computing systems. New York (NY): Association for Computing Machinery; 2025

  76. [83]

    Mla-trust: benchmarking trustworthiness of multimodal llm agents in gui environments

    Yang X, Chen J, Luo J, Fang Z, Dong Y, Su H, et al. Mla-trust: benchmarking trustworthiness of multimodal llm agents in gui environments. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2506.01616

  77. [84]

    Spdx specification

    SPDX Workgroup. Spdx specification. The Linux Founda- tion. 2021 [accessed on 15 January 2026]. Available from: https://spdx.dev/specifications/

  78. [85]

    Sigstore: software signing for every- body

    The Sigstore Project. Sigstore: software signing for every- body. The Linux Foundation. 2022 [accessed on 15 January 2026]. Available from: https://www.sigstore.dev/

  79. [86]

    AI safety vs

    Lin Z, Sun H, Shroff N. AI safety vs. AI security: demysti- fying the distinction and boundaries. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2502 .13175

  80. [87]

    Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

    Ma X, Gao Y, Wang Y, Wang R, Wang X, Sun Y, et al. Safety at scale: a comprehensive survey of large model and agent safety. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2502.05206

Showing first 80 references.