Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

Dianzhi Yu; Irwin King; Jiahong Liu; Jinhu Qi; Muzhi Li; Ruoxi Jiang; Shicheng Ma; Wenqian Cui; Yiyang Zhao; Yiyi Chen

arxiv: 2605.23989 · v1 · pith:73VPCPPDnew · submitted 2026-05-17 · 💻 cs.AI · cs.CL· cs.CR

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

Jinhu Qi , Muzhi Li , Jiahong Liu , Yuqin Shu , Dianzhi Yu , Shicheng Ma , Wenqian Cui , Yiyang Zhao

show 4 more authors

Yiyi Chen Ruoxi Jiang Irwin King Zenglin Xu

This is my paper

Pith reviewed 2026-06-30 19:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CR

keywords agentic AItrustworthy AIsafetyrobustnessprivacysystem securityLLM agentsbenchmarks

0 comments

The pith

Agentic AI systems require stage-specific safeguards for safety, robustness, privacy, and security to handle multi-step failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines trustworthy agentic AI, defined as large language models augmented with planning, tool use, memory, and long-horizon interactions that execute tasks autonomously. It focuses on two dimensions critical for high-risk uses—Safety and Robustness, and Privacy and System Security—by defining key concepts, locating risks along the agent workflow, and summarizing targeted mitigation strategies from the literature. A unified metrics-and-benchmarks hub is assembled to enable consistent evaluation using both outcome and process signals, along with scenario-to-metric guidance for deployment decisions. Other trustworthiness aspects receive only contextual mention. The work concludes with open challenges and a case study of security failures in open-source systems to serve as a practical reference.

Core claim

Agentic AI introduces new failure modes through multi-step trajectories, and the survey addresses trustworthiness by mapping risks to workflow stages within Safety and Robustness as well as Privacy and System Security, while consolidating evaluation resources into a single metrics-and-benchmarks hub that includes scenario guidance for release decisions.

What carries the argument

The unified metrics-and-benchmarks hub that consolidates outcome and process signals (such as constraint violations, trace completeness, and adversarial success rates) with scenario-to-metric guidance for evaluation.

If this is right

Workflow-stage risk mapping enables targeted interventions that reduce specific failure modes such as constraint violations.
The metrics-and-benchmarks hub supports consistent comparison across systems using both outcome and process measures.
Scenario-to-metric guidance improves release gating by linking evaluation signals directly to deployment contexts.
Attention to listed open challenges like runtime monitoring will be required to maintain trustworthiness as agents evolve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The survey's workflow focus could be extended to test whether adding verification checkpoints at each stage measurably lowers adversarial success rates in controlled experiments.
The case study of open-source failures implies that public disclosure of attack traces might accelerate community development of the metrics hub.
Treating the hub as a living resource would require periodic updates tied to new agent architectures to keep the scenario guidance current.

Load-bearing premise

The key concepts, workflow-stage risks, and stage-targeted mitigation strategies drawn from existing literature are comprehensive and representative enough to guide high-stakes deployments.

What would settle it

A documented failure in a deployed agentic system whose root cause lies outside the surveyed workflow stages, risks, or mitigations and is not captured by the proposed metrics hub.

read the original abstract

Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that structures existing work on agentic AI safety and security but gives no account of how the literature was selected.

read the letter

This paper is a survey that organizes literature on safety, robustness, privacy, and system security for agentic AI. It breaks risks down by workflow stages such as planning and tool use, summarizes mitigations, and collects evaluation metrics into one hub with scenario guidance. It also includes a case study on real open-source failures and lists open challenges like runtime monitoring and the trust-utility trade-off.

What it does reasonably well is give a clear map of where problems arise in multi-step agent trajectories and why process signals like trace completeness matter alongside outcome metrics. The stage-targeted approach makes the material easier to apply than a flat list of risks would be.

The main limitation is the absence of any literature selection protocol. No search strategy, databases, date cutoffs, or inclusion criteria appear in the description, so it is not possible to verify whether the listed concepts and mitigations are representative or whether recent work on memory-augmented failures or tool-calling privacy was overlooked. That gap directly limits how much the metrics hub can be treated as a reliable reference for high-stakes use.

The paper is aimed at researchers and practitioners who need an organized starting point on trustworthiness for agentic systems rather than new technical results. A reader looking for evaluation frameworks might find the consolidated hub and scenario-to-metric suggestions practical to consult.

The work shows straightforward engagement with the cited literature and no obvious internal contradictions. It deserves peer review because the topic is timely and the structure could be strengthened with feedback, particularly on coverage and methods.

Referee Report

1 major / 0 minor

Summary. The paper is a survey of trustworthy agentic AI that focuses on two dimensions—Safety and Robustness, and Privacy and System Security—clarifying key concepts, mapping risks to stages of the agent workflow (planning, tool use, memory, long-horizon interaction), summarizing stage-targeted mitigations, consolidating evaluation into a unified metrics-and-benchmarks hub with scenario-to-metric guidance, discussing other trustworthiness aspects as context, outlining open challenges (self-evolving agents, runtime monitoring, privacy-preserving personalization, trust-utility trade-off), and presenting a case study of real-world security failures in open-source systems.

Significance. A well-documented survey that consolidates workflow-stage risks, mitigations, and a metrics hub could serve as a practical reference for high-stakes deployments by enabling consistent comparison and release gating decisions; the inclusion of outcome and process signals (constraint violations, trace completeness, adversarial success rates) and the case study add concrete utility if coverage is representative.

major comments (1)

[Abstract] Abstract (and implied introduction): the central claim that the survey consolidates 'key concepts, workflow-stage risks, and stage-targeted mitigation strategies' into a 'practical reference' for high-stakes environments rests on the assumption of representative coverage, yet no search methodology, database list, date range, inclusion/exclusion criteria, or PRISMA-style flow is described; this directly prevents verification that important works (e.g., on memory-augmented failure modes or tool-calling privacy leakage) were not omitted.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback highlighting the need for explicit documentation of our literature search process. We agree this is a valid point for a survey claiming representative coverage and will revise accordingly to strengthen verifiability.

read point-by-point responses

Referee: [Abstract] Abstract (and implied introduction): the central claim that the survey consolidates 'key concepts, workflow-stage risks, and stage-targeted mitigation strategies' into a 'practical reference' for high-stakes environments rests on the assumption of representative coverage, yet no search methodology, database list, date range, inclusion/exclusion criteria, or PRISMA-style flow is described; this directly prevents verification that important works (e.g., on memory-augmented failure modes or tool-calling privacy leakage) were not omitted.

Authors: We acknowledge that the absence of a documented search methodology limits the ability to independently verify coverage and that this is a substantive limitation for a survey positioned as a practical reference. In the revised version, we will insert a new subsection (likely in Section 1 or a dedicated 'Survey Methodology' paragraph) that explicitly describes: (1) the primary databases and repositories searched (arXiv, Google Scholar, ACL Anthology, IEEE Xplore); (2) the keyword combinations and Boolean queries used (e.g., 'agentic AI' AND ('safety' OR 'robustness' OR 'privacy' OR 'tool use' OR 'memory')); (3) the time window (primarily 2022–early 2025, with selected foundational works); (4) inclusion criteria centered on works addressing multi-step agent workflows; and (5) a high-level PRISMA-style flow summarizing screening steps. We will also note that the survey is intentionally focused rather than exhaustive. With respect to the specific examples raised, memory-augmented failure modes are addressed in the memory-stage risk subsection and tool-calling privacy leakage appears in the tool-use privacy discussion; the added methodology section will make it easier for readers to assess whether additional references should be incorporated. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive survey with no derivations or self-referential predictions

full rationale

This is a literature survey paper that organizes and summarizes external research on safety, robustness, privacy, and security in agentic AI systems. It identifies concepts, risks, mitigations, and metrics from prior work without any mathematical derivations, equations, fitted parameters, or claims that a result is predicted from first principles. No load-bearing steps reduce to self-definition, self-citation chains, or renaming of known results; the paper explicitly positions itself as a consolidation of external literature rather than an original derivation. The central claim of providing a practical reference rests on the breadth of cited works, not on any internal reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper with no new mathematical models, empirical claims, or derivations; therefore no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5802 in / 1076 out tokens · 35588 ms · 2026-06-30T19:16:26.863304+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agents That Know Too Much: A Data-Centric Survey of Privacy in LLM Agents
cs.CR 2026-06 unverdicted novelty 5.0

A data-centric survey finds that only information-flow control covers compositional and cross-session leakage in LLM agents and that no single benchmark tests an agent across all its data surfaces under one policy.

Reference graph

Works this paper leans on

203 extracted references · 104 canonical work pages · cited by 1 Pith paper · 17 internal anchors

[1]

The rise and potential of large language model based agents: a survey

Xi Z, Chen W, Guo X, He W, Ding Y, Hong B, et al. The rise and potential of large language model based agents: a survey. Sci China Inf Sci. 2025;68(2):121101. doi: 10.1007/ s11432-024-4222-0

2025
[2]

FinHEAR: human expertise and adaptive risk-aware tem- poral reasoning for financial decision-making

Chen J, Zou M, Wang Z, Wang Q, Sun DD, Chi Z, et al. FinHEAR: human expertise and adaptive risk-aware tem- poral reasoning for financial decision-making. Findings of the association for computational linguistics: EMNLP 2025. Suzhou: Association for Computational Linguistics; 2025. p. 1648–72. doi: 10.18653/v1/2025.findings-emnlp.87

work page doi:10.18653/v1/2025.findings-emnlp.87 2025
[3]

End-to-End Optimization of LLM-Driven Multi-Agent Search Systems via Heterogeneous-Group-Based Reinforcement Learning

Chen G, Yang S, Li C, Liu W, Luan J, Xu Z. Heterogeneous group-based reinforcement learning for llm-based multi- agent systems. arXiv; 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2506.02718

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Zero-click ai vulnerability exposes mi- crosoft 365 copilot data without user interaction

Lakshmanan R. Zero-click ai vulnerability exposes mi- crosoft 365 copilot data without user interaction. The Hacker News. 12 June 2025 [accessed on 31 December 2025]. Available from: https://thehackernews.com/2025/0 6/zero-click-ai-vulnerability-exposes.html

2025
[5]

How microsoft defends against indirect prompt injection attacks

Paverd A. How microsoft defends against indirect prompt injection attacks. Microsoft Security Response Center (MSRC) Blog. 29 July 2025 [accessed on 31 December 2025]. Available from: https://www.microsoft.com/en-us/ msrc/blog/2025/07/how-microsoft-defends-against-indi rect-prompt-injection-attacks

2025
[7]

Llm01: prompt INJECTION—owasp genai security project

OWASP. Llm01: prompt INJECTION—owasp genai security project. online. 2024 [accessed on 31 December 2025]. Available from: https://genai.owasp.org/llmrisk2023-24/l lm01-24-prompt-injection/

2024
[8]

Limited effectiveness of llm-based data augmentation for COVID-19 misinformation stance detection

Choi EC, Balasubramanian A, Qi J, Ferrara E. Limited effectiveness of llm-based data augmentation for COVID-19 misinformation stance detection. In Companion Proceed- ings of the ACM on Web Conference 2025, WWW ’25; New York (NY): Association for Computing Machinery; 2025. p. 934–7. ISBN 9798400713316. doi: 10.1145/3701716.371552 1

work page doi:10.1145/3701716.371552 2025
[9]

From evidence to trajectory: abductive reasoning path synthesis for training retrieval-augmented generation agents

Li M, Qi J, Wu Y, Zhao M, Ma L, Li Y, et al. From evidence to trajectory: abductive reasoning path synthesis for training retrieval-augmented generation agents. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/25 09.23071

2025
[10]

Voyager: an open-ended embodied agent with large lan- guage models

Wang G, Xie Y, Jiang Y, Mandlekar A, Xiao C, Zhu Y, et al. Voyager: an open-ended embodied agent with large lan- guage models. Transactions on machine learning research (TMLR); 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=ehfRiF0R3a

2023
[11]

MemGPT: towards LLMs as operating systems

Packer C, Wooders S, Lin K, Fang V, Patil SG, Stoica I, et al. MemGPT: towards LLMs as operating systems. The twelfth international conference on learning representa- tions (ICLR). 2024 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=LeYFkQxaAK

2024
[12]

Agentic context engineering: evolving contexts for self-improving language models

Zhang Q, Hu C, Upasani S, Ma B, Hong F, Kamanuru V, et al. Agentic context engineering: evolving contexts for self-improving language models. arXiv; 2025 [accessed on October 2025]. Available from: https://arxiv.org/abs/2510 .04618

2025
[13]

Find the gap: AI, responsible agency and vulnerability

Vallor S, Vierkant T. Find the gap: AI, responsible agency and vulnerability. Minds Mach. 2024;34(3):20. doi: 10.100 7/s11023-024-09674-0

2024
[15]

Risk, uncertainty and AI: non- probabilistic methods for anticipating and preventing AI risks

Gutfraind A, Bier VM. Risk, uncertainty and AI: non- probabilistic methods for anticipating and preventing AI risks. Technical report, University of Illinois. 2023 [accessed on 15 January 2026]. Available from: https://www.ideals.i llinois.edu/items/129049

2023
[16]

Trustworthy artificial intelligence: a review

Kaur D, Uslu S, Rittichier KJ, Durresi A. Trustworthy artificial intelligence: a review. ACM Comput Surv. 2022;55 (2):1–38. doi: 10.1145/3491209

work page doi:10.1145/3491209 2022
[17]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Liu Y, Yao Y, Ton J-F, Zhang X, Guo R, Cheng H, et al Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. 2024 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2308.05374

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

TrustLLM: trustworthiness in large language models

Huang Y, Sun L, Wang H, Wu S, Zhang Q, Li Y, et al. TrustLLM: trustworthiness in large language models. Pro- ceedings of the 41st International Conference on Machine Learning (ICML); 21–27 July 2024; Vienna, Austria. 2024 [accessed on 15 January 2026]. Available from: https://pr oceedings.mlr.press/v235/huang24x.html

2024
[19]

A survey on trustworthy llm agents: threats and countermea- sures

Yu M, Meng F, Zhou X, Wang S, Mao J, Pang L, et al. A survey on trustworthy llm agents: threats and countermea- sures. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’25); 3–7 August 2025; Toronto, ON, Canada. 2025; p. 6216–26. doi: 10.1145/3711896.3736561

work page doi:10.1145/3711896.3736561 2025
[21]

Pennec, P

Ali MA, Dornaika F, Charafeddine J. Agentic AI: a com- prehensive survey of architectures, applications, and future directions. Artif Intell Rev. 2025;59(1):11. doi: 10.1007/s1 0462-025-11422-4

work page doi:10.1007/s1 2025
[22]

Llm-based agents for tool learning: a survey

Xu W, Huang C, Gao S, Shang S. Llm-based agents for tool learning: a survey. Data Sci Eng. 2025;10(4):533–63. doi: 10.1007/s41019-025-00296-9

work page doi:10.1007/s41019-025-00296-9 2025
[23]

Artificial intelligence: a modern ap- proach

Russell S, Norvig P. Artificial intelligence: a modern ap- proach. 4th ed. London: Pearson; 2021

2021
[24]

Planning and acting in partially observable stochastic domains

Kaelbling LP, Littman ML, Cassandra AR. Planning and acting in partially observable stochastic domains. Artif Intell. 1998;101(1–2):99–134. doi: 10.1016/S0004-3702(98 )00023-X

work page doi:10.1016/s0004-3702(98 1998
[25]

Reinforcement learning: an introduc- tion

Sutton RS, Barto AG. Reinforcement learning: an introduc- tion. 2nd ed. Cambridge (MA): MIT Press; 2018

2018
[26]

ReAct: synergizing reasoning and acting in language models

Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K, et al. ReAct: synergizing reasoning and acting in language models. The eleventh international conference on learn- ing representations (ICLR). 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id= WE_vluYUL-X

2023
[27]

Bernstein

Park JS, O’Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS. Generative agents: interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Sympo- sium on User Interface Software and Technology (UIST); 29 October–1 November 2023; San Francisco, CA, USA. 2023. doi: 10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023
[28]

Artifi- cial intelligence risk management framework (AI RMF 1.0)

National Institute of Standards and Technology. Artifi- cial intelligence risk management framework (AI RMF 1.0). Technical report NIST AI 100-1; National Institute of Stan- dards and Technology (NIST). Voluntary framework for managing AI risks, guidance for trustworthy AI systems. 2023 [accessed on 15 January 2026]. Available from: https: //nvlpubs.nist....

2023
[29]

Retrieval-augmented generation for knowledge- intensive nlp tasks

Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. Advances in neural information pro- cessing systems 33. Red Hook (NY): Curran Associates, Inc.;
[30]

p. 9459–74. [accessed on 15 January 2026]. Available from: https://proceedings.neurips.cc/paper/2020/file/6b4 93230205f780e1bc26945df7481e5-Paper.pdf

2026
[31]

World models

Ha D, Schmidhuber J. World models. Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS); 3–8 December 2018; Montréal, Canada. 2018

2018
[32]

Between MDPs and semi- MDPs: a framework for temporal abstraction in reinforce- ment learning

Sutton RS, Precup D, Singh S. Between MDPs and semi- MDPs: a framework for temporal abstraction in reinforce- ment learning. Artif Intell. 1999;112(1–2):181–211. doi: 10.1016/S0004-3702(99)00052-1

work page doi:10.1016/s0004-3702(99)00052-1 1999
[33]

Toolformer: language models can teach themselves to use tools

Schick T, Dwivedi-Yu J, Dessí R, Raileanu R, Lomeli M, Hambro E, et al. Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th In- ternational Conference on Neural Information Processing Systems, NIPS ’23; 10–16 December 2023; New Orleans, LA, USA. Red Hook (NY): Curran Associates Inc.; 2023

2023
[34]

Recode-h: a benchmark for research code development with interactive human feedback

Miao C, Zou HP, Li Y, Chen Y, Wang Y, Wang F, et al. Recode-h: a benchmark for research code development with interactive human feedback. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2510.06186

work page arXiv 2025
[35]

Reflexion: language agents with verbal reinforcement learning

Shinn N, Cassano F, Berman E, Gopinath A, Narasimhan K, Yao S. Reflexion: language agents with verbal reinforcement learning. Advances in neural information processing sys- tems 36 (NeurIPS). 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=vAElhF cKW6

2023
[36]

A Survey of Large Language Models

Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. IEEE Access. 2024 [accessed on 15 January 2026]. Available from: https://ar xiv.org/abs/2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Markov decision processes: discrete stochas- tic dynamic programming

Puterman ML. Markov decision processes: discrete stochas- tic dynamic programming. Wiley Series in Probability and Statistics. Hoboken (NJ): John Wiley & Sons; 1994. ISBN 9780471619772

1994
[38]

Multi-agent reinforcement learning: a selective overview of theories and algorithms

Zhang K, Yang Z, Basar T. Multi-agent reinforcement learning: a selective overview of theories and algorithms. Handbook of reinforcement learning and control. Cham: Springer; 2021. doi: 10.1007/978-3-030-60990-0_12

work page doi:10.1007/978-3-030-60990-0_12 2021
[40]

Human-Level Control through Deep Reinforce- ment Learning

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Belle- mare MG, et al. Human-level control through deep rein- forcement learning. Nature. 2015;518(7540):529–33. doi: 10.1038/nature14236

work page doi:10.1038/nature14236 2015
[41]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Levine S, Kumar A, Tucker G, Fu J. Offline reinforcement learning: tutorial, review, and perspectives on open prob- lems. 2020 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2005.01643

work page internal anchor Pith review Pith/arXiv arXiv 2020
[42]

Conservative Q-learning for offline reinforcement learning

Kumar A, Zhou A, Tucker G, Levine S. Conservative Q-learning for offline reinforcement learning. Advances in neural information processing systems 33. Red Hook (NY): Curran Associates, Inc.; 2020. p. 1179–91. [accessed on 15 January 2026]. Available from: https://papers.nips.cc/paper/2020/hash/0d2b20618 26a5df3221116a5085a6052-Paper.pdf

2020
[43]

Data-efficient hierarchical reinforcement learning

Nachum O, Gu S, Lee H, Levine S. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems 31 (NeurIPS 2018). Red Hook (NY): Cur- ran Associates, Inc.; 2018. p. 3307–17. [accessed on 15 Jan- uary 2026]. Available from: http://papers.nips.cc/paper/7 591-data-efficient-hierarchical-reinforcement-learning.pdf

2018
[44]

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Chua K, Calandra R, McAllister R, Levine S. Deep rein- forcement learning in a handful of trials using probabilistic dynamics models. NeurIPS. 2018 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/1805.12114

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

When to trust your model: model-based policy optimization

Janner M, Fu J, Zhang M, Levine S. When to trust your model: model-based policy optimization. NeurIPS. 2019 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/1906.08253

work page arXiv 2019
[46]

Constrained markov decision processes

Altman E. Constrained markov decision processes. Boca Raton (FL): Chapman & Hall/CRC; 1999. ISBN 9780849303821

1999
[47]

Constrained policy optimization

Achiam J, Held D, Tamar A, Abbeel P. Constrained policy optimization. In: Precup D, Teh YW, editors. Proceedings of the 34th International Conference on Machine Learning (ICML 2017). Vol. 70. Proceedings of Machine Learning Research. PMLR. 2017. p. 22–31. [accessed on 15 January 2026]. Available from: https://proceedings.mlr.press/v70/ achiam17a.html

2017
[48]

A comprehensive survey on safe reinforcement learning

García J, Fernández F. A comprehensive survey on safe reinforcement learning. J Mach Learn Res. 2015;16(1): 1437–80. doi: 10.5555/2886795

work page doi:10.5555/2886795 2015
[49]

Safe reinforcement learning via shielding

Alshiekh M, Bloem R, Ehlers R, Könighofer B, Niekum S, Topcu U. Safe reinforcement learning via shielding. Pro- ceedings of the Thirty-Second AAAI Conference on Artifi- cial Intelligence; New Orleans (LA): AAAI Press; 2018. p. 2669–78. [accessed on 15 January 2026]. Available from: ht tps://ojs.aaai.org/index.php/AAAI/article/view/11797 doi: 10.1609/aaai....

work page doi:10.1609/aaai.v32i1.11797 2018
[50]

Deep reinforcement learning from human pref- erences

Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D. Deep reinforcement learning from human pref- erences. Advances in neural information processing sys- tems 30. Red Hook (NY): Curran Associates, Inc.; 2017. p. 4299–307. [accessed on 15 January 2026]. Available from: http://papers.nips.cc/paper/7017-deep-reinforceme nt-learning-from-human-prefer...

2017
[51]

Fine-Tuning Language Models from Human Preferences

Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, et al. Fine-tuning language models from human preferences. Advances in neural information processing systems 32 (NeurIPS). 2019. [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

Learning to summarize from human feedback

Stiennon N, Ouyang L, Wu J, Ziegler DM, Lowe R, Voss C, et al. Learning to summarize from human feedback. Proceedings of the 34th international conference on neural information processing systems. Red Hook (NY): Curran Associates, Inc.; 2020. p. 4302–10. [accessed on 15 January 2026]. Available from: https://proceedings.neurips.cc/paper/2020/file/1f8 9885...

2020
[53]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in neural information processing systems. Vol. 35. Red Hook (NY): Curran Associates, Inc.; 2022. p. 27730–44. [accessed on 15 January 2026]. ...

2022
[55]

Multi-objective reinforcement learning for provably incentivising alignment with value systems

Rodriguez-Soto M, Rădulescu R, Bistaffa F, Ricart O, Mayoral-Macau A, et al. Multi-objective reinforcement learning for provably incentivising alignment with value systems. Artif Intell. 2025;351:104460. doi: 10.1016/j.arti nt.2025.104460

work page doi:10.1016/j.arti 2025
[56]

An approximate embedding for designing ethical reinforcement learning environments

Mayoral Macau A, Rodríguez-Soto M, Marchesini E, Sánchez-Fibla M, López-Sánchez M, Rodríguez-Aguilar JA, et al. An approximate embedding for designing ethical reinforcement learning environments. Proceedings of the 28th European conference on artificial intelligence (ECAI), 2025 [accessed on 15 January 2026]. Available from: https://ebooks.iospress.nl/vol...

2025
[57]

Encoding ethics to compute value-aligned norms

Serramia M, Rodriguez-Soto M, Lopez-Sanchez M, Rodriguez-Aguilar JA, Bistaffa F, Boddington P, et al. Encoding ethics to compute value-aligned norms. Minds Mach. 2023;33(4):761–90. doi: 10.1007/s11023-023-09649-7

work page doi:10.1007/s11023-023-09649-7 2023
[58]

Direct preference optimization: Your language model is secretly a reward model

Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C. Direct preference optimization: Your language model is secretly a reward model. Thirty-seventh conference on neural information processing systems. 2023 [accessed on 15 January 2026]. Available from: https://openreview.n et/forum?id=HPuSIXJaa9

2023
[59]

Model alignment as prospect theoretic optimization

Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. Model alignment as prospect theoretic optimization. Pro- ceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org; 21–27 July 2024; Vienna, Austria. 2024

2024
[61]

Domain randomization for transferring deep neural networks from simulation to the real world

Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P. Domain randomization for transferring deep neural networks from simulation to the real world. Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); 24–28 September 2017; Van- couver, BC, Canada. 2017. p. 23–30. doi: 10.1109/IROS.201 7.8202133

work page doi:10.1109/iros.201 2017
[62]

Concrete Problems in AI Safety

Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D. Concrete problems in AI safety. arXiv; 2016 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016
[63]

Improving generalization in game agents with data augmen- tation in imitation learning

Yadgaroff D, Sestini A, Tollmar K, Ozcelikkale A, Gisslén L. Improving generalization in game agents with data augmen- tation in imitation learning. 2023 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2309.12815

work page arXiv 2023
[64]

Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

Mazeika M, Phan L, Yin X, Zou A, Wang Z, Mu N, et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. Proceedings of the 41st International Conference on Machine Learn- ing, ICML’24. JMLR.org; 21–27 July 2024; Vienna, Austria. 2024

2024
[65]

Out-of-distribution detection for reinforcement learn- ing agents with probabilistic dynamics models

Haider T, Roscher K, Schmoeller da Roza F, Günnemann S. Out-of-distribution detection for reinforcement learn- ing agents with probabilistic dynamics models. Proceed- ings of the 2023 International Conference on Autonomous Agents and Multiagent Systems; Richland (SC): Interna- tional Foundation for Autonomous Agents and Multiagent Systems; 2023. p. 851–9....

2023
[66]

Distribution- ally robust neural networks for group shifts: on the impor- tance of regularization for worst-case generalization

Sagawa S, Koh PW, Hashimoto TB, Liang P. Distribution- ally robust neural networks for group shifts: on the impor- tance of regularization for worst-case generalization. Inter- national Conference on Learning Representations (ICLR). 2020 [accessed on 15 January 2026]. Available from: https: //openreview.net/forum?id=ryxGuJrFvS

2020
[67]

Constitutional AI: Harmlessness from AI Feedback

Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, et al. Constitutional AI: harmlessness from AI feedback. 2022 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

Kushwaha A, Ravish K, Lamba P, Kumar P. A survey of safe reinforcement learning and constrained mdps: a tech- nical survey on single-agent and multi-agent safety. 2025 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/2505.17342

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Webguard: building a generalizable guardrail for web agents

Zheng B, Liao Z, Salisbury S, Liu Z, Lin M, Zheng Q, et al. Webguard: building a generalizable guardrail for web agents. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2507.14293

work page arXiv 2025
[70]

Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems

Aswath Duvvuru VS, Zhang B, Vierhauser M, Agrawal A. Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems. Proceedings of the IEEE/ACM 47th International Conference on Software En- gineering, ICSE ’25; 27 April–3 May 2025; Ottawa, ON, Canada. Hoboken (NJ): IEEE Press; 2025. p. 385–97. ISBN 9798331505691. doi: 10.1...

work page doi:10.1109/icse55347.2025.00223 2025
[71]

The temporal logic of programs

Pnueli A. The temporal logic of programs. Proceedings of the 18th Annual Symposium on Foundations of Computer Sci- ence (sfcs 1977); 31 October–2 November 1977; Providence, RI, USA. 1977; p. 46–57. doi: 10.1109/SFCS.1977.32

work page doi:10.1109/sfcs.1977.32 1977
[72]

Hidden technical debt in machine learning sys- tems

Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, et al. Hidden technical debt in machine learning sys- tems. In Advances in neural information processing systems (NeurIPS). Cambridge (MA): MIT Press; 2015. p. 2503–11

2015
[73]

The ML test score: a rubric for ML production readiness and technical debt reduction

Breck E, Cai S, Nielsen E, Salib M, Sculley D. The ML test score: a rubric for ML production readiness and technical debt reduction. Proceedings of the 2017 IEEE International Conference on Big Data (Big Data); 11–14 December 2017; Boston, MA, USA. 2017. doi: 10.1109/BigData.2017.82580 38

work page doi:10.1109/bigdata.2017.82580 2017
[74]

Continuous delivery with spinnaker: fast, safe, repeatable multi-cloud deployments

Burns E, Feldman A, Fletcher R, Lin T, Reynolds J, Sanden C, et al. Continuous delivery with spinnaker: fast, safe, repeatable multi-cloud deployments. Chapter 8: automated canary analysis. Sebastopol (CA): O’Reilly Media; 2018

2018
[75]

Safe policy im- provement with baseline bootstrapping

Laroche R, Trichelair P, des Combes RT. Safe policy im- provement with baseline bootstrapping. Proceedings of the 36th International Conference on Machine Learning (ICML); 9–15 June 2019; Long Beach, CA, USA. Vol. 97 of proceedings of machine learning research. PMLR. 2019. p. 3652–61. doi: 10.1007/978-3-030-46133-1_4. Available from: https: //proceedings....

work page doi:10.1007/978-3-030-46133-1_4 2019
[76]

Open problems in cooperative AI

Dafoe A, Hughes E, Bachrach Y, Collins T, McKee KR, Leibo JZ, et al. Open problems in cooperative AI. 2020 [accessed on 15 January 2026]. Available from: https://arxiv.org/ab s/2012.08630

work page arXiv 2020
[77]

Searching for Privacy Risks in LLM Agents via Simulation

Zhang Y, Yang D. Searching for privacy risks in llm agents via simulation. 2025 [accessed on 15 January 2026]. Avail- able from: https://arxiv.org/abs/2508.10880

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Beyond data privacy: new privacy risks for large language models

Du Y, Li Z, Li N, Ding B. Beyond data privacy: new privacy risks for large language models. 2025 [accessed on 15 Jan- uary 2026]. Available from: https://arxiv.org/abs/2509.142 78

2025
[79]

Zero trust architecture

Rose S, Borchert O, Mitchell S, Connelly S. Zero trust architecture. Technical report NIST special publication 800-207. Gaithersburg (MD): National Institute of Stan- dards and Technology; 2020 [accessed on 15 January 2026]. Available from: https://csrc.nist.gov/publications/d etail/sp/800-207/final

2020
[81]

Privacy as contextual integrity

Nissenbaum H. Privacy as contextual integrity. Wash Law Rev. 2004;79(1):119. [accessed on 15 January 2026]. Avail- able from: https://digitalcommons.law.uw.edu/wlr/vol79/ iss1/10

2004
[82]

Privweb: unobtrusive and content-aware privacy protection for web agents

Zhang S, Jiang Y, Ma R, Yang Y, Xu M, Huang Z, et al. Privweb: unobtrusive and content-aware privacy protection for web agents. In CHI ’26: proceedings of the 2026 CHI conference on human factors in computing systems. New York (NY): Association for Computing Machinery; 2025

2026
[83]

Mla-trust: benchmarking trustworthiness of multimodal llm agents in gui environments

Yang X, Chen J, Luo J, Fang Z, Dong Y, Su H, et al. Mla-trust: benchmarking trustworthiness of multimodal llm agents in gui environments. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2506.01616

work page arXiv 2025
[84]

Spdx specification

SPDX Workgroup. Spdx specification. The Linux Founda- tion. 2021 [accessed on 15 January 2026]. Available from: https://spdx.dev/specifications/

2021
[85]

Sigstore: software signing for every- body

The Sigstore Project. Sigstore: software signing for every- body. The Linux Foundation. 2022 [accessed on 15 January 2026]. Available from: https://www.sigstore.dev/

2022
[86]

AI safety vs

Lin Z, Sun H, Shroff N. AI safety vs. AI security: demysti- fying the distinction and boundaries. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2502 .13175

2025
[87]

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Ma X, Gao Y, Wang Y, Wang R, Wang X, Sun Y, et al. Safety at scale: a comprehensive survey of large model and agent safety. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2502.05206

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

The rise and potential of large language model based agents: a survey

Xi Z, Chen W, Guo X, He W, Ding Y, Hong B, et al. The rise and potential of large language model based agents: a survey. Sci China Inf Sci. 2025;68(2):121101. doi: 10.1007/ s11432-024-4222-0

2025

[2] [2]

FinHEAR: human expertise and adaptive risk-aware tem- poral reasoning for financial decision-making

Chen J, Zou M, Wang Z, Wang Q, Sun DD, Chi Z, et al. FinHEAR: human expertise and adaptive risk-aware tem- poral reasoning for financial decision-making. Findings of the association for computational linguistics: EMNLP 2025. Suzhou: Association for Computational Linguistics; 2025. p. 1648–72. doi: 10.18653/v1/2025.findings-emnlp.87

work page doi:10.18653/v1/2025.findings-emnlp.87 2025

[3] [3]

End-to-End Optimization of LLM-Driven Multi-Agent Search Systems via Heterogeneous-Group-Based Reinforcement Learning

Chen G, Yang S, Li C, Liu W, Luan J, Xu Z. Heterogeneous group-based reinforcement learning for llm-based multi- agent systems. arXiv; 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2506.02718

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Zero-click ai vulnerability exposes mi- crosoft 365 copilot data without user interaction

Lakshmanan R. Zero-click ai vulnerability exposes mi- crosoft 365 copilot data without user interaction. The Hacker News. 12 June 2025 [accessed on 31 December 2025]. Available from: https://thehackernews.com/2025/0 6/zero-click-ai-vulnerability-exposes.html

2025

[5] [5]

How microsoft defends against indirect prompt injection attacks

Paverd A. How microsoft defends against indirect prompt injection attacks. Microsoft Security Response Center (MSRC) Blog. 29 July 2025 [accessed on 31 December 2025]. Available from: https://www.microsoft.com/en-us/ msrc/blog/2025/07/how-microsoft-defends-against-indi rect-prompt-injection-attacks

2025

[6] [7]

Llm01: prompt INJECTION—owasp genai security project

OWASP. Llm01: prompt INJECTION—owasp genai security project. online. 2024 [accessed on 31 December 2025]. Available from: https://genai.owasp.org/llmrisk2023-24/l lm01-24-prompt-injection/

2024

[7] [8]

Limited effectiveness of llm-based data augmentation for COVID-19 misinformation stance detection

Choi EC, Balasubramanian A, Qi J, Ferrara E. Limited effectiveness of llm-based data augmentation for COVID-19 misinformation stance detection. In Companion Proceed- ings of the ACM on Web Conference 2025, WWW ’25; New York (NY): Association for Computing Machinery; 2025. p. 934–7. ISBN 9798400713316. doi: 10.1145/3701716.371552 1

work page doi:10.1145/3701716.371552 2025

[8] [9]

From evidence to trajectory: abductive reasoning path synthesis for training retrieval-augmented generation agents

Li M, Qi J, Wu Y, Zhao M, Ma L, Li Y, et al. From evidence to trajectory: abductive reasoning path synthesis for training retrieval-augmented generation agents. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/25 09.23071

2025

[9] [10]

Voyager: an open-ended embodied agent with large lan- guage models

Wang G, Xie Y, Jiang Y, Mandlekar A, Xiao C, Zhu Y, et al. Voyager: an open-ended embodied agent with large lan- guage models. Transactions on machine learning research (TMLR); 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=ehfRiF0R3a

2023

[10] [11]

MemGPT: towards LLMs as operating systems

Packer C, Wooders S, Lin K, Fang V, Patil SG, Stoica I, et al. MemGPT: towards LLMs as operating systems. The twelfth international conference on learning representa- tions (ICLR). 2024 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=LeYFkQxaAK

2024

[11] [12]

Agentic context engineering: evolving contexts for self-improving language models

Zhang Q, Hu C, Upasani S, Ma B, Hong F, Kamanuru V, et al. Agentic context engineering: evolving contexts for self-improving language models. arXiv; 2025 [accessed on October 2025]. Available from: https://arxiv.org/abs/2510 .04618

2025

[12] [13]

Find the gap: AI, responsible agency and vulnerability

Vallor S, Vierkant T. Find the gap: AI, responsible agency and vulnerability. Minds Mach. 2024;34(3):20. doi: 10.100 7/s11023-024-09674-0

2024

[13] [15]

Risk, uncertainty and AI: non- probabilistic methods for anticipating and preventing AI risks

Gutfraind A, Bier VM. Risk, uncertainty and AI: non- probabilistic methods for anticipating and preventing AI risks. Technical report, University of Illinois. 2023 [accessed on 15 January 2026]. Available from: https://www.ideals.i llinois.edu/items/129049

2023

[14] [16]

Trustworthy artificial intelligence: a review

Kaur D, Uslu S, Rittichier KJ, Durresi A. Trustworthy artificial intelligence: a review. ACM Comput Surv. 2022;55 (2):1–38. doi: 10.1145/3491209

work page doi:10.1145/3491209 2022

[15] [17]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Liu Y, Yao Y, Ton J-F, Zhang X, Guo R, Cheng H, et al Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. 2024 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2308.05374

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [18]

TrustLLM: trustworthiness in large language models

Huang Y, Sun L, Wang H, Wu S, Zhang Q, Li Y, et al. TrustLLM: trustworthiness in large language models. Pro- ceedings of the 41st International Conference on Machine Learning (ICML); 21–27 July 2024; Vienna, Austria. 2024 [accessed on 15 January 2026]. Available from: https://pr oceedings.mlr.press/v235/huang24x.html

2024

[17] [19]

A survey on trustworthy llm agents: threats and countermea- sures

Yu M, Meng F, Zhou X, Wang S, Mao J, Pang L, et al. A survey on trustworthy llm agents: threats and countermea- sures. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’25); 3–7 August 2025; Toronto, ON, Canada. 2025; p. 6216–26. doi: 10.1145/3711896.3736561

work page doi:10.1145/3711896.3736561 2025

[18] [21]

Pennec, P

Ali MA, Dornaika F, Charafeddine J. Agentic AI: a com- prehensive survey of architectures, applications, and future directions. Artif Intell Rev. 2025;59(1):11. doi: 10.1007/s1 0462-025-11422-4

work page doi:10.1007/s1 2025

[19] [22]

Llm-based agents for tool learning: a survey

Xu W, Huang C, Gao S, Shang S. Llm-based agents for tool learning: a survey. Data Sci Eng. 2025;10(4):533–63. doi: 10.1007/s41019-025-00296-9

work page doi:10.1007/s41019-025-00296-9 2025

[20] [23]

Artificial intelligence: a modern ap- proach

Russell S, Norvig P. Artificial intelligence: a modern ap- proach. 4th ed. London: Pearson; 2021

2021

[21] [24]

Planning and acting in partially observable stochastic domains

Kaelbling LP, Littman ML, Cassandra AR. Planning and acting in partially observable stochastic domains. Artif Intell. 1998;101(1–2):99–134. doi: 10.1016/S0004-3702(98 )00023-X

work page doi:10.1016/s0004-3702(98 1998

[22] [25]

Reinforcement learning: an introduc- tion

Sutton RS, Barto AG. Reinforcement learning: an introduc- tion. 2nd ed. Cambridge (MA): MIT Press; 2018

2018

[23] [26]

ReAct: synergizing reasoning and acting in language models

Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K, et al. ReAct: synergizing reasoning and acting in language models. The eleventh international conference on learn- ing representations (ICLR). 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id= WE_vluYUL-X

2023

[24] [27]

Bernstein

Park JS, O’Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS. Generative agents: interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Sympo- sium on User Interface Software and Technology (UIST); 29 October–1 November 2023; San Francisco, CA, USA. 2023. doi: 10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023

[25] [28]

Artifi- cial intelligence risk management framework (AI RMF 1.0)

National Institute of Standards and Technology. Artifi- cial intelligence risk management framework (AI RMF 1.0). Technical report NIST AI 100-1; National Institute of Stan- dards and Technology (NIST). Voluntary framework for managing AI risks, guidance for trustworthy AI systems. 2023 [accessed on 15 January 2026]. Available from: https: //nvlpubs.nist....

2023

[26] [29]

Retrieval-augmented generation for knowledge- intensive nlp tasks

Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks. Advances in neural information pro- cessing systems 33. Red Hook (NY): Curran Associates, Inc.;

[27] [30]

p. 9459–74. [accessed on 15 January 2026]. Available from: https://proceedings.neurips.cc/paper/2020/file/6b4 93230205f780e1bc26945df7481e5-Paper.pdf

2026

[28] [31]

World models

Ha D, Schmidhuber J. World models. Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS); 3–8 December 2018; Montréal, Canada. 2018

2018

[29] [32]

Between MDPs and semi- MDPs: a framework for temporal abstraction in reinforce- ment learning

Sutton RS, Precup D, Singh S. Between MDPs and semi- MDPs: a framework for temporal abstraction in reinforce- ment learning. Artif Intell. 1999;112(1–2):181–211. doi: 10.1016/S0004-3702(99)00052-1

work page doi:10.1016/s0004-3702(99)00052-1 1999

[30] [33]

Toolformer: language models can teach themselves to use tools

Schick T, Dwivedi-Yu J, Dessí R, Raileanu R, Lomeli M, Hambro E, et al. Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th In- ternational Conference on Neural Information Processing Systems, NIPS ’23; 10–16 December 2023; New Orleans, LA, USA. Red Hook (NY): Curran Associates Inc.; 2023

2023

[31] [34]

Recode-h: a benchmark for research code development with interactive human feedback

Miao C, Zou HP, Li Y, Chen Y, Wang Y, Wang F, et al. Recode-h: a benchmark for research code development with interactive human feedback. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2510.06186

work page arXiv 2025

[32] [35]

Reflexion: language agents with verbal reinforcement learning

Shinn N, Cassano F, Berman E, Gopinath A, Narasimhan K, Yao S. Reflexion: language agents with verbal reinforcement learning. Advances in neural information processing sys- tems 36 (NeurIPS). 2023 [accessed on 15 January 2026]. Available from: https://openreview.net/forum?id=vAElhF cKW6

2023

[33] [36]

A Survey of Large Language Models

Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. IEEE Access. 2024 [accessed on 15 January 2026]. Available from: https://ar xiv.org/abs/2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [37]

Markov decision processes: discrete stochas- tic dynamic programming

Puterman ML. Markov decision processes: discrete stochas- tic dynamic programming. Wiley Series in Probability and Statistics. Hoboken (NJ): John Wiley & Sons; 1994. ISBN 9780471619772

1994

[35] [38]

Multi-agent reinforcement learning: a selective overview of theories and algorithms

Zhang K, Yang Z, Basar T. Multi-agent reinforcement learning: a selective overview of theories and algorithms. Handbook of reinforcement learning and control. Cham: Springer; 2021. doi: 10.1007/978-3-030-60990-0_12

work page doi:10.1007/978-3-030-60990-0_12 2021

[36] [40]

Human-Level Control through Deep Reinforce- ment Learning

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Belle- mare MG, et al. Human-level control through deep rein- forcement learning. Nature. 2015;518(7540):529–33. doi: 10.1038/nature14236

work page doi:10.1038/nature14236 2015

[37] [41]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Levine S, Kumar A, Tucker G, Fu J. Offline reinforcement learning: tutorial, review, and perspectives on open prob- lems. 2020 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2005.01643

work page internal anchor Pith review Pith/arXiv arXiv 2020

[38] [42]

Conservative Q-learning for offline reinforcement learning

Kumar A, Zhou A, Tucker G, Levine S. Conservative Q-learning for offline reinforcement learning. Advances in neural information processing systems 33. Red Hook (NY): Curran Associates, Inc.; 2020. p. 1179–91. [accessed on 15 January 2026]. Available from: https://papers.nips.cc/paper/2020/hash/0d2b20618 26a5df3221116a5085a6052-Paper.pdf

2020

[39] [43]

Data-efficient hierarchical reinforcement learning

Nachum O, Gu S, Lee H, Levine S. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems 31 (NeurIPS 2018). Red Hook (NY): Cur- ran Associates, Inc.; 2018. p. 3307–17. [accessed on 15 Jan- uary 2026]. Available from: http://papers.nips.cc/paper/7 591-data-efficient-hierarchical-reinforcement-learning.pdf

2018

[40] [44]

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Chua K, Calandra R, McAllister R, Levine S. Deep rein- forcement learning in a handful of trials using probabilistic dynamics models. NeurIPS. 2018 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/1805.12114

work page internal anchor Pith review Pith/arXiv arXiv 2018

[41] [45]

When to trust your model: model-based policy optimization

Janner M, Fu J, Zhang M, Levine S. When to trust your model: model-based policy optimization. NeurIPS. 2019 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/1906.08253

work page arXiv 2019

[42] [46]

Constrained markov decision processes

Altman E. Constrained markov decision processes. Boca Raton (FL): Chapman & Hall/CRC; 1999. ISBN 9780849303821

1999

[43] [47]

Constrained policy optimization

Achiam J, Held D, Tamar A, Abbeel P. Constrained policy optimization. In: Precup D, Teh YW, editors. Proceedings of the 34th International Conference on Machine Learning (ICML 2017). Vol. 70. Proceedings of Machine Learning Research. PMLR. 2017. p. 22–31. [accessed on 15 January 2026]. Available from: https://proceedings.mlr.press/v70/ achiam17a.html

2017

[44] [48]

A comprehensive survey on safe reinforcement learning

García J, Fernández F. A comprehensive survey on safe reinforcement learning. J Mach Learn Res. 2015;16(1): 1437–80. doi: 10.5555/2886795

work page doi:10.5555/2886795 2015

[45] [49]

Safe reinforcement learning via shielding

Alshiekh M, Bloem R, Ehlers R, Könighofer B, Niekum S, Topcu U. Safe reinforcement learning via shielding. Pro- ceedings of the Thirty-Second AAAI Conference on Artifi- cial Intelligence; New Orleans (LA): AAAI Press; 2018. p. 2669–78. [accessed on 15 January 2026]. Available from: ht tps://ojs.aaai.org/index.php/AAAI/article/view/11797 doi: 10.1609/aaai....

work page doi:10.1609/aaai.v32i1.11797 2018

[46] [50]

Deep reinforcement learning from human pref- erences

Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D. Deep reinforcement learning from human pref- erences. Advances in neural information processing sys- tems 30. Red Hook (NY): Curran Associates, Inc.; 2017. p. 4299–307. [accessed on 15 January 2026]. Available from: http://papers.nips.cc/paper/7017-deep-reinforceme nt-learning-from-human-prefer...

2017

[47] [51]

Fine-Tuning Language Models from Human Preferences

Ziegler DM, Stiennon N, Wu J, Brown TB, Radford A, Amodei D, et al. Fine-tuning language models from human preferences. Advances in neural information processing systems 32 (NeurIPS). 2019. [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 2019

[48] [52]

Learning to summarize from human feedback

Stiennon N, Ouyang L, Wu J, Ziegler DM, Lowe R, Voss C, et al. Learning to summarize from human feedback. Proceedings of the 34th international conference on neural information processing systems. Red Hook (NY): Curran Associates, Inc.; 2020. p. 4302–10. [accessed on 15 January 2026]. Available from: https://proceedings.neurips.cc/paper/2020/file/1f8 9885...

2020

[49] [53]

Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in neural information processing systems. Vol. 35. Red Hook (NY): Curran Associates, Inc.; 2022. p. 27730–44. [accessed on 15 January 2026]. ...

2022

[50] [55]

Multi-objective reinforcement learning for provably incentivising alignment with value systems

Rodriguez-Soto M, Rădulescu R, Bistaffa F, Ricart O, Mayoral-Macau A, et al. Multi-objective reinforcement learning for provably incentivising alignment with value systems. Artif Intell. 2025;351:104460. doi: 10.1016/j.arti nt.2025.104460

work page doi:10.1016/j.arti 2025

[51] [56]

An approximate embedding for designing ethical reinforcement learning environments

Mayoral Macau A, Rodríguez-Soto M, Marchesini E, Sánchez-Fibla M, López-Sánchez M, Rodríguez-Aguilar JA, et al. An approximate embedding for designing ethical reinforcement learning environments. Proceedings of the 28th European conference on artificial intelligence (ECAI), 2025 [accessed on 15 January 2026]. Available from: https://ebooks.iospress.nl/vol...

2025

[52] [57]

Encoding ethics to compute value-aligned norms

Serramia M, Rodriguez-Soto M, Lopez-Sanchez M, Rodriguez-Aguilar JA, Bistaffa F, Boddington P, et al. Encoding ethics to compute value-aligned norms. Minds Mach. 2023;33(4):761–90. doi: 10.1007/s11023-023-09649-7

work page doi:10.1007/s11023-023-09649-7 2023

[53] [58]

Direct preference optimization: Your language model is secretly a reward model

Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C. Direct preference optimization: Your language model is secretly a reward model. Thirty-seventh conference on neural information processing systems. 2023 [accessed on 15 January 2026]. Available from: https://openreview.n et/forum?id=HPuSIXJaa9

2023

[54] [59]

Model alignment as prospect theoretic optimization

Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. Model alignment as prospect theoretic optimization. Pro- ceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org; 21–27 July 2024; Vienna, Austria. 2024

2024

[55] [61]

Domain randomization for transferring deep neural networks from simulation to the real world

Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P. Domain randomization for transferring deep neural networks from simulation to the real world. Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); 24–28 September 2017; Van- couver, BC, Canada. 2017. p. 23–30. doi: 10.1109/IROS.201 7.8202133

work page doi:10.1109/iros.201 2017

[56] [62]

Concrete Problems in AI Safety

Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D. Concrete problems in AI safety. arXiv; 2016 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016

[57] [63]

Improving generalization in game agents with data augmen- tation in imitation learning

Yadgaroff D, Sestini A, Tollmar K, Ozcelikkale A, Gisslén L. Improving generalization in game agents with data augmen- tation in imitation learning. 2023 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2309.12815

work page arXiv 2023

[58] [64]

Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

Mazeika M, Phan L, Yin X, Zou A, Wang Z, Mu N, et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. Proceedings of the 41st International Conference on Machine Learn- ing, ICML’24. JMLR.org; 21–27 July 2024; Vienna, Austria. 2024

2024

[59] [65]

Out-of-distribution detection for reinforcement learn- ing agents with probabilistic dynamics models

Haider T, Roscher K, Schmoeller da Roza F, Günnemann S. Out-of-distribution detection for reinforcement learn- ing agents with probabilistic dynamics models. Proceed- ings of the 2023 International Conference on Autonomous Agents and Multiagent Systems; Richland (SC): Interna- tional Foundation for Autonomous Agents and Multiagent Systems; 2023. p. 851–9....

2023

[60] [66]

Distribution- ally robust neural networks for group shifts: on the impor- tance of regularization for worst-case generalization

Sagawa S, Koh PW, Hashimoto TB, Liang P. Distribution- ally robust neural networks for group shifts: on the impor- tance of regularization for worst-case generalization. Inter- national Conference on Learning Representations (ICLR). 2020 [accessed on 15 January 2026]. Available from: https: //openreview.net/forum?id=ryxGuJrFvS

2020

[61] [67]

Constitutional AI: Harmlessness from AI Feedback

Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, et al. Constitutional AI: harmlessness from AI feedback. 2022 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[62] [68]

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

Kushwaha A, Ravish K, Lamba P, Kumar P. A survey of safe reinforcement learning and constrained mdps: a tech- nical survey on single-agent and multi-agent safety. 2025 [accessed on 15 January 2026]. Available from: https://arxi v.org/abs/2505.17342

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [69]

Webguard: building a generalizable guardrail for web agents

Zheng B, Liao Z, Salisbury S, Liu Z, Lin M, Zheng Q, et al. Webguard: building a generalizable guardrail for web agents. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2507.14293

work page arXiv 2025

[64] [70]

Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems

Aswath Duvvuru VS, Zhang B, Vierhauser M, Agrawal A. Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems. Proceedings of the IEEE/ACM 47th International Conference on Software En- gineering, ICSE ’25; 27 April–3 May 2025; Ottawa, ON, Canada. Hoboken (NJ): IEEE Press; 2025. p. 385–97. ISBN 9798331505691. doi: 10.1...

work page doi:10.1109/icse55347.2025.00223 2025

[65] [71]

The temporal logic of programs

Pnueli A. The temporal logic of programs. Proceedings of the 18th Annual Symposium on Foundations of Computer Sci- ence (sfcs 1977); 31 October–2 November 1977; Providence, RI, USA. 1977; p. 46–57. doi: 10.1109/SFCS.1977.32

work page doi:10.1109/sfcs.1977.32 1977

[66] [72]

Hidden technical debt in machine learning sys- tems

Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, et al. Hidden technical debt in machine learning sys- tems. In Advances in neural information processing systems (NeurIPS). Cambridge (MA): MIT Press; 2015. p. 2503–11

2015

[67] [73]

The ML test score: a rubric for ML production readiness and technical debt reduction

Breck E, Cai S, Nielsen E, Salib M, Sculley D. The ML test score: a rubric for ML production readiness and technical debt reduction. Proceedings of the 2017 IEEE International Conference on Big Data (Big Data); 11–14 December 2017; Boston, MA, USA. 2017. doi: 10.1109/BigData.2017.82580 38

work page doi:10.1109/bigdata.2017.82580 2017

[68] [74]

Continuous delivery with spinnaker: fast, safe, repeatable multi-cloud deployments

Burns E, Feldman A, Fletcher R, Lin T, Reynolds J, Sanden C, et al. Continuous delivery with spinnaker: fast, safe, repeatable multi-cloud deployments. Chapter 8: automated canary analysis. Sebastopol (CA): O’Reilly Media; 2018

2018

[69] [75]

Safe policy im- provement with baseline bootstrapping

Laroche R, Trichelair P, des Combes RT. Safe policy im- provement with baseline bootstrapping. Proceedings of the 36th International Conference on Machine Learning (ICML); 9–15 June 2019; Long Beach, CA, USA. Vol. 97 of proceedings of machine learning research. PMLR. 2019. p. 3652–61. doi: 10.1007/978-3-030-46133-1_4. Available from: https: //proceedings....

work page doi:10.1007/978-3-030-46133-1_4 2019

[70] [76]

Open problems in cooperative AI

Dafoe A, Hughes E, Bachrach Y, Collins T, McKee KR, Leibo JZ, et al. Open problems in cooperative AI. 2020 [accessed on 15 January 2026]. Available from: https://arxiv.org/ab s/2012.08630

work page arXiv 2020

[71] [77]

Searching for Privacy Risks in LLM Agents via Simulation

Zhang Y, Yang D. Searching for privacy risks in llm agents via simulation. 2025 [accessed on 15 January 2026]. Avail- able from: https://arxiv.org/abs/2508.10880

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [78]

Beyond data privacy: new privacy risks for large language models

Du Y, Li Z, Li N, Ding B. Beyond data privacy: new privacy risks for large language models. 2025 [accessed on 15 Jan- uary 2026]. Available from: https://arxiv.org/abs/2509.142 78

2025

[73] [79]

Zero trust architecture

Rose S, Borchert O, Mitchell S, Connelly S. Zero trust architecture. Technical report NIST special publication 800-207. Gaithersburg (MD): National Institute of Stan- dards and Technology; 2020 [accessed on 15 January 2026]. Available from: https://csrc.nist.gov/publications/d etail/sp/800-207/final

2020

[74] [81]

Privacy as contextual integrity

Nissenbaum H. Privacy as contextual integrity. Wash Law Rev. 2004;79(1):119. [accessed on 15 January 2026]. Avail- able from: https://digitalcommons.law.uw.edu/wlr/vol79/ iss1/10

2004

[75] [82]

Privweb: unobtrusive and content-aware privacy protection for web agents

Zhang S, Jiang Y, Ma R, Yang Y, Xu M, Huang Z, et al. Privweb: unobtrusive and content-aware privacy protection for web agents. In CHI ’26: proceedings of the 2026 CHI conference on human factors in computing systems. New York (NY): Association for Computing Machinery; 2025

2026

[76] [83]

Mla-trust: benchmarking trustworthiness of multimodal llm agents in gui environments

Yang X, Chen J, Luo J, Fang Z, Dong Y, Su H, et al. Mla-trust: benchmarking trustworthiness of multimodal llm agents in gui environments. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2506.01616

work page arXiv 2025

[77] [84]

Spdx specification

SPDX Workgroup. Spdx specification. The Linux Founda- tion. 2021 [accessed on 15 January 2026]. Available from: https://spdx.dev/specifications/

2021

[78] [85]

Sigstore: software signing for every- body

The Sigstore Project. Sigstore: software signing for every- body. The Linux Foundation. 2022 [accessed on 15 January 2026]. Available from: https://www.sigstore.dev/

2022

[79] [86]

AI safety vs

Lin Z, Sun H, Shroff N. AI safety vs. AI security: demysti- fying the distinction and boundaries. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2502 .13175

2025

[80] [87]

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Ma X, Gao Y, Wang Y, Wang R, Wang X, Sun Y, et al. Safety at scale: a comprehensive survey of large model and agent safety. 2025 [accessed on 15 January 2026]. Available from: https://arxiv.org/abs/2502.05206

work page internal anchor Pith review Pith/arXiv arXiv 2025