Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

Changkun Ou

arxiv: 2605.19151 · v1 · pith:Q3BMP2ZMnew · submitted 2026-05-18 · 💻 cs.AI · cs.HC

Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

Changkun Ou This is my paper

Pith reviewed 2026-05-20 09:57 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords trust calibrationpreference learningGaussian processesBayesian optimizationagentic AIhuman feedbackautonomous agentsrisk tolerance

0 comments

The pith

Trust calibration for agentic tool use is formalized as an instance of preferential Bayesian optimization

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes deciding when an AI agent's proposed action can execute autonomously versus requiring human approval as a preference learning problem. It models human binary approve or deny feedback as observations from a probit likelihood on a latent Gaussian process representing risk tolerance. A policy gateway then escalates to the human precisely where the approval outcome is most uncertain according to the posterior. This setup inherits the inference and sample-efficiency machinery of preferential Bayesian optimization but applies it to partitioning an action space into allow, block, and ask regions rather than optimizing a single design.

Core claim

Trust calibration for agentic tool use is structurally an instance of Preferential Bayesian Optimization. The approach maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve or deny feedback, and escalates to the human exactly where the approval outcome is most uncertain. This differs from standard preferential optimization by classifying the action space into allow, block, or ask regions instead of optimizing a design point.

What carries the argument

The policy gateway that maintains a Gaussian-process posterior over a latent human risk-tolerance function, updated via probit likelihood from binary feedback and using uncertainty-targeted querying to decide escalations.

If this is right

The method inherits approximate Gaussian-process classification machinery for inference.
Sample efficiency follows from uncertainty-targeted querying rather than random or exhaustive feedback.
The action space is partitioned into allow, block, and ask regions based on the risk-tolerance posterior.
The framework provides a principled way to increase autonomy while controlling risk through targeted human oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same uncertainty-driven escalation could be tested in adjacent domains such as automated content moderation or clinical decision support where binary human overrides are available.
Extending the model to handle delayed or contextual feedback would require only changes to the observation model while preserving the preferential Bayesian optimization structure.
Real-world deployment could measure reduction in human interventions compared to fixed-threshold rules to quantify practical gains.

Load-bearing premise

Binary human approve or deny feedback can be adequately modeled as observations from a probit likelihood on a latent Gaussian-process risk-tolerance function.

What would settle it

An experiment that collects sequences of human approve/deny responses to agent actions and checks whether the observed uncertainty patterns match the predictions of the Gaussian-process posterior; systematic mismatch between predicted and actual escalation rates would falsify the model.

Figures

Figures reproduced from arXiv: 2605.19151 by Changkun Ou.

**Figure 1.** Figure 1: Left: the rolling policy mix. The allow share rises as the posterior concentrates (the ask band narrows), then collapses at the t = 750 trust changepoint and recovers, exactly the Section 5 and Section 6 dynamics. Right: correlated generalization (Section 7) to an action-context combination that was never queried. cumulative trajectory shows a larger gap because the cold-start learn phase escalates heavily… view at source ↗

**Figure 2.** Figure 2: Left: cumulative human queries, gateway versus always-escalate (Section 1). Right: a [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

We formalize trust calibration for agentic tool use (deciding when an automated agent's proposed action may execute autonomously versus require human approval) as a preference-learning problem. A policy gateway maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates to the human exactly where the approval outcome is most uncertain. We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery (approximate Gaussian-process classification) and its sample-efficiency argument (uncertainty-targeted querying), while differing in objective: classifying an action space into allow/block/ask regions rather than optimizing a design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames trust calibration for agents as GP-probit preference learning and calls it an instance of preferential Bayesian optimization, but the sample-efficiency transfer to region classification is not shown.

read the letter

This paper recasts deciding when an AI agent can act autonomously as a preference-learning task. It maintains a Gaussian-process posterior over a latent human risk-tolerance function, updates it with probit likelihood from binary approve/deny feedback, and escalates to the human where the posterior is most uncertain. The central move is to treat this as structurally equivalent to preferential Bayesian optimization while changing the target from locating an optimum to partitioning the action space into allow, block, and ask regions.

Referee Report

2 major / 2 minor

Summary. The manuscript formalizes trust calibration for agentic tool use as a preference learning problem. It proposes a policy gateway that maintains a Gaussian process posterior over a latent human risk-tolerance function, using a probit likelihood for binary approve/deny feedback. The system escalates to human review where approval is most uncertain. The central claim is that this setup is structurally an instance of Preferential Bayesian Optimization (PBO), thereby inheriting its approximate GP classification inference and uncertainty-targeted querying for sample efficiency, while the objective is to classify the action space into allow, block, and ask regions rather than optimizing a design point.

Significance. Should the formalization be sound and the sample-efficiency transfer to the region-classification task, the work could provide a theoretically grounded method for calibrating trust in autonomous agents. This has implications for developing safer and more efficient human-AI collaborative systems, reducing human cognitive load while preserving oversight in uncertain cases. The connection to PBO literature offers a pathway to leverage existing tools in a new application domain.

major comments (2)

Abstract: The claim that the method inherits PBO's sample-efficiency argument via uncertainty-targeted querying does not include a supporting derivation or bound. The paper notes the objective shift to classifying action space regions, but standard PBO arguments target optimum location; it is unclear if uncertainty sampling suffices for accurate posterior over boundaries and interiors under the same query budget.
The manuscript provides no full derivation of the inference machinery or empirical validation, despite describing the GP posterior and probit likelihood. This makes it difficult to assess the correctness of the approximate Gaussian-process classification for the trust calibration gateway.

minor comments (2)

Consider adding a figure or pseudocode illustrating the policy gateway decision process and how the GP posterior is updated.
Ensure consistent notation for the latent function and the regions (allow/block/ask) throughout the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on the manuscript. The feedback highlights key areas where the connection to Preferential Bayesian Optimization requires further clarification. We address each major comment point by point below, indicating planned revisions to strengthen the formalization.

read point-by-point responses

Referee: Abstract: The claim that the method inherits PBO's sample-efficiency argument via uncertainty-targeted querying does not include a supporting derivation or bound. The paper notes the objective shift to classifying action space regions, but standard PBO arguments target optimum location; it is unclear if uncertainty sampling suffices for accurate posterior over boundaries and interiors under the same query budget.

Authors: We agree that the manuscript does not supply an explicit derivation or bound demonstrating that uncertainty-targeted querying preserves sample efficiency under the shifted objective of region classification. The current text relies on the structural equivalence to PBO without detailing the transfer. In the revised version we will insert a clarifying discussion (likely in the main text or a dedicated subsection) explaining that uncertainty sampling remains beneficial because accurate delineation of allow/block/ask regions depends on resolving posterior uncertainty near the latent decision boundaries; this is a direct consequence of the same querying strategy used in PBO. A full new theoretical bound tailored to classification is beyond the scope of the present formalization and is left for future work. revision: partial
Referee: The manuscript provides no full derivation of the inference machinery or empirical validation, despite describing the GP posterior and probit likelihood. This makes it difficult to assess the correctness of the approximate Gaussian-process classification for the trust calibration gateway.

Authors: The inference follows the standard approximate Gaussian-process classification procedure with probit likelihood that is already established in the PBO literature; we therefore described the posterior and likelihood without repeating the full derivation for brevity. To address the concern, the revised manuscript will include a short appendix that sketches the key steps of the approximate inference (e.g., Laplace or expectation-propagation updates) so that readers can verify correctness. Regarding empirical validation, the manuscript is a formalization paper whose primary contribution is the structural mapping to PBO; systematic experiments in agentic tool-use settings are important but lie outside the current scope and will be noted as future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity; formalization maps to external PBO framework without self-referential reduction

full rationale

The paper formalizes trust calibration as a preference-learning problem using a Gaussian-process posterior over a latent risk-tolerance function with probit likelihood, then states that this setup is structurally an instance of Preferential Bayesian Optimization. This allows inheritance of approximate GP classification inference and uncertainty-targeted querying, while explicitly differing in objective from design optimization to action-space region classification. No equations reduce a claimed prediction or first-principles result to a fitted parameter or self-defined quantity by construction, and no load-bearing self-citation chain is invoked to justify the core mapping. The derivation remains self-contained as an explicit structural equivalence to prior external machinery rather than an internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on standard assumptions of Gaussian-process regression and probit classification for modeling latent preferences; no new entities are introduced and no free parameters are explicitly fitted in the provided abstract.

axioms (2)

domain assumption Human approve/deny decisions are generated from a probit likelihood on a latent risk-tolerance function
Invoked to turn binary feedback into observations for the GP posterior
standard math The latent function can be represented by a Gaussian process
Used to maintain a posterior over risk tolerance

pith-pipeline@v0.9.0 · 5635 in / 1342 out tokens · 43292 ms · 2026-05-20T09:57:35.462199+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery (approximate Gaussian-process classification) and its sample-efficiency argument (uncertainty-targeted querying)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Place a GP prior over f: f ∼ GP(μ₀, k(x, x′))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection.arXiv preprint arXiv:0710.3742, 2007. URLhttps://arxiv.org/abs/0710.3742

work page internal anchor Pith review Pith/arXiv arXiv 2007
[2]

Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy

Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. BoTorch: A framework for efficient monte-carlo bayesian optimization. InAdvances in Neural Information Processing Systems 33 (NeurIPS), volume 33, pages 21524–21538. Curran Associates, Inc., 2020. URLhttps://proceedings. neurips.cc/...

work page 2020
[3]

A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning

Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of ex- pensive cost functions, with application to active user modeling and hierarchical reinforcement learning.arXiv preprint arXiv:1012.2599, 2010. URLhttps://arxiv.org/abs/1012.2599

work page internal anchor Pith review Pith/arXiv arXiv 2010
[4]

Harms from increasingly agentic algorithmic systems

Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, Micah Carroll, Michelle Lin, Alex Mayhew, Katherine Collins, Maryam Molamohammadi, John Burden, Wanru Zhao, Sha- laleh Rismani, Konstantinos Voudouris, Umang Bhatt, Adrian Weller, David Krueger, and Tegan Maharaj. H...

work page 2023
[5]

doi: 10.1145/3593013.3594033

ACM, 2023. doi: 10.1145/3593013.3594033. URLhttps://doi.org/10.1145/3593013. 3594033

work page doi:10.1145/3593013.3594033 2023
[6]

Visibility into AI agents

Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, and Markus Anderljung. Visibility into AI agents. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 958–973. ACM, 2024. doi: 10.1145/3630106. 3658948....

work page doi:10.1145/3630106 2024
[7]

In: Proceedings of the 22nd International Conference on Machine Learning

Wei Chu and Zoubin Ghahramani. Preference learning with Gaussian processes. InPro- ceedings of the 22nd International Conference on Machine Learning (ICML), pages 137–144. ACM Press, 2005. doi: 10.1145/1102351.1102369. URLhttps://doi.org/10.1145/1102351. 1102369

work page doi:10.1145/1102351.1102369 2005
[8]

de Visser, Marieke M

Ewart J. de Visser, Marieke M. M. Peeters, Malte F. Jung, Spencer Kohn, Tyler H. Shaw, Richard Pak, and Mark A. Neerincx. Towards a theory of longitudinal trust calibration in human–robot teams.International Journal of Social Robotics, 12(2):459–478, 2020. doi: 10. 1007/s12369-019-00596-x. URLhttps://doi.org/10.1007/s12369-019-00596-x

work page doi:10.1007/s12369-019-00596-x 2020
[9]

Lawrence

Javier Gonz´ alez, Zhenwen Dai, Andreas Damianou, and Neil D. Lawrence. Preferential Bayesian optimization. InProceedings of the 34th International Conference on Machine Learn- ing (ICML), volume 70 ofProceedings of Machine Learning Research, pages 1282–1291. PMLR,

work page
[10]

URLhttps://proceedings.mlr.press/v70/gonzalez17a.html

work page
[11]

Lee and Katrina A

John D. Lee and Katrina A. See. Trust in automation: Designing for appropriate reliance. Human Factors, 46(1):50–80, 2004. URLhttps://doi.org/10.1518/hfes.46.1.50_30392

work page doi:10.1518/hfes.46.1.50_30392 2004
[12]

Thomas P. Minka. Expectation propagation for approximate Bayesian inference. InProceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 362–369. Morgan Kaufmann, 2001. URLhttps://tminka.github.io/papers/ep/minka-ep-uai.pdf. 8

work page 2001
[13]

Position: Levels of AGI for operational- izing progress on the path to AGI

Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Position: Levels of AGI for operational- izing progress on the path to AGI. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 3...

work page 2024
[14]

Paciorek and Mark J

Christopher J. Paciorek and Mark J. Schervish. Nonstationary covariance functions for Gaussian process regression. InAdvances in Neural Information Processing Systems 16 (NIPS). MIT Press, 2003. URLhttps://proceedings.neurips.cc/paper/2003/hash/ 326a8c055c0d04f5b06544665d8bb3ea-Abstract.html

work page 2003
[15]

Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. ISBN 978-0-262-18253-9. doi: 10.7551/mitpress/ 3206.001.0001. URLhttps://gaussianprocess.org/gpml/chapters/RW.pdf

work page doi:10.7551/mitpress/ 2006
[16]

Maddison, and Tatsunori Hashimoto

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox. InThe Twelfth International Conference on Learning Representa- tions (ICLR), 2024. URLhttps://openreview.net/forum?id=GEcwtMk1uA. Spotlight

work page 2024
[17]

Active learning literature survey

Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, Uni- versity of Wisconsin–Madison, 2009. URLhttps://research.cs.wisc.edu/techreports/ 2009/TR1648.pdf

work page 2009
[18]

Adams, and Nando de Freitas

Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization.Proceedings of the IEEE, 104(1):148–175, 2016. doi: 10.1109/JPROC.2015.2494218. URLhttps://doi.org/10.1109/ JPROC.2015.2494218

work page doi:10.1109/jproc.2015.2494218 2016
[19]

Robinson

Yonadav Shavit, Sandhini Agarwal, Miles Brundage, Steven Adler, Cullen O’Keefe, Rosie Campbell, Teddy Lee, Pamela Mishkin, Tyna Eloundou, Alan Hickey, Katarina Slama, Lama Ahmad, Paul McMillan, Alex Beutel, Alexandre Passos, and David G. Robinson. Practices for governing agentic AI systems. White paper, OpenAI, 2023. URLhttps://cdn.openai. com/papers/prac...

work page 2023
[20]

R-judge: Benchmarking safety risk awareness for LLM agents

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-judge: Benchmarking safety risk awareness for LLM agents. InFindings of the Association for Com- putational Linguistics: EMNLP 2024, pages 1467–1490. Association for Computational Lin- guistics, 20...

work page doi:10.18653/v1/2024.findings-emnlp.79 2024
[21]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of LLM agents, 2024. URLhttps: //arxiv.org/abs/2412.14470. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection.arXiv preprint arXiv:0710.3742, 2007. URLhttps://arxiv.org/abs/0710.3742

work page internal anchor Pith review Pith/arXiv arXiv 2007

[2] [2]

Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy

Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. BoTorch: A framework for efficient monte-carlo bayesian optimization. InAdvances in Neural Information Processing Systems 33 (NeurIPS), volume 33, pages 21524–21538. Curran Associates, Inc., 2020. URLhttps://proceedings. neurips.cc/...

work page 2020

[3] [3]

A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning

Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of ex- pensive cost functions, with application to active user modeling and hierarchical reinforcement learning.arXiv preprint arXiv:1012.2599, 2010. URLhttps://arxiv.org/abs/1012.2599

work page internal anchor Pith review Pith/arXiv arXiv 2010

[4] [4]

Harms from increasingly agentic algorithmic systems

Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, Micah Carroll, Michelle Lin, Alex Mayhew, Katherine Collins, Maryam Molamohammadi, John Burden, Wanru Zhao, Sha- laleh Rismani, Konstantinos Voudouris, Umang Bhatt, Adrian Weller, David Krueger, and Tegan Maharaj. H...

work page 2023

[5] [5]

doi: 10.1145/3593013.3594033

ACM, 2023. doi: 10.1145/3593013.3594033. URLhttps://doi.org/10.1145/3593013. 3594033

work page doi:10.1145/3593013.3594033 2023

[6] [6]

Visibility into AI agents

Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, and Markus Anderljung. Visibility into AI agents. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 958–973. ACM, 2024. doi: 10.1145/3630106. 3658948....

work page doi:10.1145/3630106 2024

[7] [7]

In: Proceedings of the 22nd International Conference on Machine Learning

Wei Chu and Zoubin Ghahramani. Preference learning with Gaussian processes. InPro- ceedings of the 22nd International Conference on Machine Learning (ICML), pages 137–144. ACM Press, 2005. doi: 10.1145/1102351.1102369. URLhttps://doi.org/10.1145/1102351. 1102369

work page doi:10.1145/1102351.1102369 2005

[8] [8]

de Visser, Marieke M

Ewart J. de Visser, Marieke M. M. Peeters, Malte F. Jung, Spencer Kohn, Tyler H. Shaw, Richard Pak, and Mark A. Neerincx. Towards a theory of longitudinal trust calibration in human–robot teams.International Journal of Social Robotics, 12(2):459–478, 2020. doi: 10. 1007/s12369-019-00596-x. URLhttps://doi.org/10.1007/s12369-019-00596-x

work page doi:10.1007/s12369-019-00596-x 2020

[9] [9]

Lawrence

Javier Gonz´ alez, Zhenwen Dai, Andreas Damianou, and Neil D. Lawrence. Preferential Bayesian optimization. InProceedings of the 34th International Conference on Machine Learn- ing (ICML), volume 70 ofProceedings of Machine Learning Research, pages 1282–1291. PMLR,

work page

[10] [10]

URLhttps://proceedings.mlr.press/v70/gonzalez17a.html

work page

[11] [11]

Lee and Katrina A

John D. Lee and Katrina A. See. Trust in automation: Designing for appropriate reliance. Human Factors, 46(1):50–80, 2004. URLhttps://doi.org/10.1518/hfes.46.1.50_30392

work page doi:10.1518/hfes.46.1.50_30392 2004

[12] [12]

Thomas P. Minka. Expectation propagation for approximate Bayesian inference. InProceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 362–369. Morgan Kaufmann, 2001. URLhttps://tminka.github.io/papers/ep/minka-ep-uai.pdf. 8

work page 2001

[13] [13]

Position: Levels of AGI for operational- izing progress on the path to AGI

Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Position: Levels of AGI for operational- izing progress on the path to AGI. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 3...

work page 2024

[14] [14]

Paciorek and Mark J

Christopher J. Paciorek and Mark J. Schervish. Nonstationary covariance functions for Gaussian process regression. InAdvances in Neural Information Processing Systems 16 (NIPS). MIT Press, 2003. URLhttps://proceedings.neurips.cc/paper/2003/hash/ 326a8c055c0d04f5b06544665d8bb3ea-Abstract.html

work page 2003

[15] [15]

Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. ISBN 978-0-262-18253-9. doi: 10.7551/mitpress/ 3206.001.0001. URLhttps://gaussianprocess.org/gpml/chapters/RW.pdf

work page doi:10.7551/mitpress/ 2006

[16] [16]

Maddison, and Tatsunori Hashimoto

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox. InThe Twelfth International Conference on Learning Representa- tions (ICLR), 2024. URLhttps://openreview.net/forum?id=GEcwtMk1uA. Spotlight

work page 2024

[17] [17]

Active learning literature survey

Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, Uni- versity of Wisconsin–Madison, 2009. URLhttps://research.cs.wisc.edu/techreports/ 2009/TR1648.pdf

work page 2009

[18] [18]

Adams, and Nando de Freitas

Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization.Proceedings of the IEEE, 104(1):148–175, 2016. doi: 10.1109/JPROC.2015.2494218. URLhttps://doi.org/10.1109/ JPROC.2015.2494218

work page doi:10.1109/jproc.2015.2494218 2016

[19] [19]

Robinson

Yonadav Shavit, Sandhini Agarwal, Miles Brundage, Steven Adler, Cullen O’Keefe, Rosie Campbell, Teddy Lee, Pamela Mishkin, Tyna Eloundou, Alan Hickey, Katarina Slama, Lama Ahmad, Paul McMillan, Alex Beutel, Alexandre Passos, and David G. Robinson. Practices for governing agentic AI systems. White paper, OpenAI, 2023. URLhttps://cdn.openai. com/papers/prac...

work page 2023

[20] [20]

R-judge: Benchmarking safety risk awareness for LLM agents

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-judge: Benchmarking safety risk awareness for LLM agents. InFindings of the Association for Com- putational Linguistics: EMNLP 2024, pages 1467–1490. Association for Computational Lin- guistics, 20...

work page doi:10.18653/v1/2024.findings-emnlp.79 2024

[21] [21]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of LLM agents, 2024. URLhttps: //arxiv.org/abs/2412.14470. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024