Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims

Adrian Weller; Alex Ingerman; Allan Dafoe; Amanda Askell; Andrew Lohn; Andrew Trask; Ariel Herbert-Voss; Ben Laurie; Bianca Martin; Brian Tse

arxiv: 2004.07213 · v2 · pith:JESFYQN6new · submitted 2020-04-15 · 💻 cs.CY

Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims

Miles Brundage , Shahar Avin , Jasmine Wang , Haydn Belfield , Gretchen Krueger , Gillian Hadfield , Heidy Khlaaf , Jingying Yang

show 51 more authors

Helen Toner Ruth Fong Tegan Maharaj Pang Wei Koh Sara Hooker Jade Leung Andrew Trask Emma Bluemke Jonathan Lebensold Cullen O'Keefe Mark Koren Th\'eo Ryffel JB Rubinovitz Tamay Besiroglu Federica Carugati Jack Clark Peter Eckersley Sarah de Haas Maritza Johnson Ben Laurie Alex Ingerman Igor Krawczuk Amanda Askell Rosario Cammarota Andrew Lohn David Krueger Charlotte Stix Peter Henderson Logan Graham Carina Prunkl Bianca Martin Elizabeth Seger Noa Zilberman Se\'an \'O h\'Eigeartaigh Frens Kroeger Girish Sastry Rebecca Kagan Adrian Weller Brian Tse Elizabeth Barnes Allan Dafoe Paul Scharre Ariel Herbert-Voss Martijn Rasser Shagun Sodhani Carrick Flynn Thomas Krendl Gilbert Lisa Dyer Saif Khan Yoshua Bengio Markus Anderljung

This is my paper

classification 💻 cs.CY

keywords claimsdevelopmentmechanismssystemstheymakeneedstakeholders

0 comments

read the original abstract

With the recent wave of progress in artificial intelligence (AI) has come a growing awareness of the large-scale impacts of AI systems, and recognition that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development. In order for AI developers to earn trust from system users, customers, civil society, governments, and other stakeholders that they are building AI responsibly, they will need to make verifiable claims to which they can be held accountable. Those outside of a given organization also need effective means of scrutinizing such claims. This report suggests various steps that different stakeholders can take to improve the verifiability of claims made about AI systems and their associated development processes, with a focus on providing evidence about the safety, security, fairness, and privacy protection of AI systems. We analyze ten mechanisms for this purpose--spanning institutions, software, and hardware--and make recommendations aimed at implementing, exploring, or improving those mechanisms.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM Agents Make Collective Belief Dynamics Programmable: Challenges and Research Directions
cs.MA 2026-05 unverdicted novelty 6.0

LLM agents make collective belief dynamics programmable, with simulations showing coordinated agents induce stable belief shifts, and four structural properties that complicate detection and defense.
Ethics Testing: Proactive Identification of Generative AI System Harms
cs.SE 2026-04 unverdicted novelty 6.0

Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
Precautionary Governance of Autonomous AI: Legal Personhood as Functional Instrument
cs.CY 2026-03 unverdicted novelty 6.0

Limited legal personhood for AI, implemented via purpose-bound operating companies within human-controlled holding structures, serves as a precautionary governance instrument that enables transparency and accountabili...
"Show Me You Comply... Without Showing Me Anything": Zero-Knowledge Software Auditing for AI-Enabled Systems
cs.SE 2025-10 unverdicted novelty 6.0

ZKMLOps is an MLOps framework that uses zero-knowledge proofs to generate verifiable cryptographic evidence of AI model compliance without revealing confidential information.
Output-Constrained Decision Trees
cs.LG 2024-05 unverdicted novelty 6.0

Presents three new training procedures for regression trees that enforce convex output constraints at training time and validates them on synthetic and hierarchical time-series data.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
cs.CL 2022-08 accept novelty 6.0

RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
The AI Evaluability Gap: The Missing Layer for Managing Risk and Sustaining Value
cs.AI 2026-06 unverdicted novelty 5.0

Introduces the AI Evaluability Gap and Evaluability framework to address missing evidentiary foundations in AI risk and value governance decisions.
CoT-Guard: Small Models for Strong Monitoring
cs.CR 2026-05 unverdicted novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
Developing an AI Concept Envisioning Toolkit to Support Reflective Juxtaposition of Values and Harms
cs.HC 2026-04 conditional novelty 5.0

A new toolkit with cards and maps enables AI designers to juxtapose values and harms in early concept stages, shown valuable in designer surveys and interviews.
Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

Intent compilation turns vague human goals into verifiable artifacts, using closure-gap vectors and delegation envelopes to separate open-world agent challenges from closed-world solvers and to benchmark closure fixes...
What Should Frontier AI Developers Disclose About Internal Deployments?
cs.CY 2026-04 unverdicted novelty 5.0

A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.
Assessing High-Risk AI Systems under the EU AI Act: From Legal Requirements to Technical Verification
cs.CY 2025-12 unverdicted novelty 5.0

A structured mapping translates EU AI Act requirements into implementable verification activities for high-risk AI systems.
MalGEN: A Testbed for Modeling and Evaluating Malware Behaviors
cs.CR 2025-06 unverdicted novelty 5.0

MalGEN generates 977 executable malware samples across 1920 settings, with 45.71% evading existing detection engines and exposing gaps in current defenses.
What Should Frontier AI Developers Disclose About Internal Deployments?
cs.CY 2026-04 unverdicted novelty 4.0

A four-category disclosure framework for internal frontier AI deployments, covering capabilities, usage, safety mitigations, and governance.
AI Identification: An Integrated Framework for Sustainable Governance in Digital Enterprises
cs.CR 2026-04 unverdicted novelty 4.0

The paper introduces a dual-layer AI identification framework that integrates cryptographic, blockchain, and zero-knowledge techniques with governance checkpoints to support lifecycle accountability in digital enterprises.
Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles
cs.AI 2025-10 unverdicted novelty 3.0

A scoping review of AIES and FAccT literature concludes that AI trustworthiness research prioritizes technical precision over social, ethical, and institutional factors, leaving the sociotechnical nature of AI systems...
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
cs.CL 2025-03 accept novelty 3.0

A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding
cs.AI 2026-04 unverdicted novelty 2.0

Squirrel behaviors supply a comparative template for a hierarchical control model that integrates latent dynamics, episodic memory, observer beliefs, and delayed verification in agentic AI.
Automation and AI Technology in Surface Mining With a Brief Introduction to Open-Pit Operations in the Pilbara
cs.CY 2023-01 unverdicted novelty 1.0

The paper surveys open-pit mining processes in the Pilbara and highlights AI/automation challenges and opportunities across nine steps from geological assessment to ore shipment.