pith. sign in

arxiv: 2605.15164 · v1 · pith:KJ6XEAW7new · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

Pith reviewed 2026-06-30 20:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords behavioural assuranceaudit gapfragile assuranceAI governancemechanistic evidencered-teamingloss of controlsafety verification
0
0 comments X

The pith

Behavioural assurance cannot verify the unobservable safety properties that current AI governance frameworks require.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that frameworks enacted from 2019 to early 2026 demand reviewable evidence for properties such as absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability. Behavioural evaluations and red-teaming are limited to observable model outputs and therefore cannot access the latent representations or long-horizon behaviours these requirements presuppose. This mismatch is formalised as the audit gap, which produces fragile assurance where the evidential base does not support the asserted safety claim. Analysis of 21 instruments reveals an incentive gradient that rewards surface-level proxies, and the authors advocate bounding the legal weight of behavioural evidence while adding mechanistic-evidence classes such as linear probes and activation patching.

Core claim

Behavioural assurance, even when carefully designed, is epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours that governance frameworks presume to regulate; this structural mismatch is formalised as the audit gap, with current methodologies producing fragile assurance in which the evidential structure fails to support the asserted safety claims.

What carries the argument

The audit gap: the divergence between the verification access required by governance frameworks and the verification access achievable by behavioural methods.

If this is right

  • Governance frameworks over-reach the epistemic limits of behavioural assurance and therefore rest on fragile evidence.
  • Incentive structures in industry and geopolitics systematically favour observable proxies over structural verification.
  • Legal text should explicitly bound the weight given to behavioural evidence.
  • Pre-deployment access should be extended to include mechanistic-evidence classes such as linear probes, activation patching, and before/after-training comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the audit gap holds, regulators may need to narrow the safety properties they attempt to verify to those that remain observable.
  • Adoption of mechanistic methods could shift assurance practice from post-training testing toward inspection of internal representations during development.
  • The distinction between behavioural and mechanistic evidence may become a new axis in international AI standards discussions.

Load-bearing premise

Governance frameworks actually require verifiable evidence of unobservable internal properties rather than merely observable behavioral outputs.

What would settle it

A demonstration that behavioural red-teaming or evaluations alone can reliably confirm the absence of hidden objectives or resistance to long-horizon loss-of-control in a model where mechanistic methods later reveal such features.

read the original abstract

This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This position paper claims that behavioural assurance methods (evaluations, red-teaming) are epistemically limited to observable outputs and cannot verify the latent properties (absence of hidden objectives, resistance to long-horizon loss-of-control) that AI governance frameworks enacted 2019–early 2026 are said to require. It formalises the resulting 'audit gap' and 'fragile assurance', reports a high-level analysis of a 21-instrument inventory that identifies an incentive gradient favouring surface proxies, and proposes bounding the legal weight of behavioural evidence while extending access for mechanistic techniques such as linear probes and activation patching.

Significance. If the central premise holds, the paper identifies a structural mismatch between regulatory demands and current verification capabilities that could systematically produce over-claimed safety assurances; the proposed pivot toward mechanistic evidence classes would be a concrete policy-technical response. The manuscript supplies no quantitative data, derivations, or machine-checked results, so its contribution is conceptual and rests entirely on the accuracy of the 21-instrument reading.

major comments (2)
  1. [21-instrument inventory analysis] The 21-instrument inventory analysis (described in the abstract and the section presenting the inventory): the claim that the frameworks 'require reviewable evidence of properties such as the absence of hidden objectives' is load-bearing for the audit-gap and fragile-assurance definitions, yet the manuscript supplies no quoted statutory or guidance language, no categorization table, and no explicit mapping showing which instruments use terms that exceed observable behavioural proxies. Without this, the mismatch remains an interpretive assertion rather than a demonstrated divergence.
  2. [formalisation of audit gap] Definition of the audit gap (early formalisation section): the gap is defined as the divergence between 'required' and 'achievable' verification access; if the 21 instruments in fact demand only observable risk assessments, red-teaming outputs, or capability benchmarks (as the skeptic note flags), the gap is empty by construction and the subsequent incentive-gradient claim loses its target.
minor comments (1)
  1. [proposal section] The abstract and proposal section refer to 'linear probes, activation patching, and before/after-training comparisons' without indicating whether these are intended as mandatory or merely voluntary supplements; a brief clarification of scope would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and substantive review. We agree that the 21-instrument analysis is presented at too high a level and that the load-bearing interpretive step requires explicit support. We will revise the manuscript to include the requested mappings, quotes, and categorization. Responses to the major comments follow.

read point-by-point responses
  1. Referee: [21-instrument inventory analysis] The 21-instrument inventory analysis (described in the abstract and the section presenting the inventory): the claim that the frameworks 'require reviewable evidence of properties such as the absence of hidden objectives' is load-bearing for the audit-gap and fragile-assurance definitions, yet the manuscript supplies no quoted statutory or guidance language, no categorization table, and no explicit mapping showing which instruments use terms that exceed observable behavioural proxies. Without this, the mismatch remains an interpretive assertion rather than a demonstrated divergence.

    Authors: We accept the point. The manuscript currently summarizes the inventory findings without supplying the underlying statutory language or a mapping table. In revision we will add an appendix containing a table that lists each of the 21 instruments, quotes the relevant passages concerning required evidence, and classifies each passage according to whether it refers only to observable behavioural outputs or also to latent properties (e.g., absence of hidden objectives or long-horizon loss-of-control resistance). This will convert the interpretive claim into an explicit, checkable mapping. revision: yes

  2. Referee: [formalisation of audit gap] Definition of the audit gap (early formalisation section): the gap is defined as the divergence between 'required' and 'achievable' verification access; if the 21 instruments in fact demand only observable risk assessments, red-teaming outputs, or capability benchmarks (as the skeptic note flags), the gap is empty by construction and the subsequent incentive-gradient claim loses its target.

    Authors: The audit-gap definition is deliberately conditional on the claim that the instruments require evidence of latent properties. We will revise the formalisation section to cross-reference the new mapping table, thereby making the divergence between required and achievable verification explicit rather than asserted. Should the added quotations show that all instruments are limited to observable outputs, we would retract the gap claim; our current reading, however, identifies language in multiple instruments that exceeds observable proxies. revision: partial

Circularity Check

0 steps flagged

No significant circularity; definitional argument with no equations, fits, or load-bearing self-citations

full rationale

The paper is a position piece that defines the 'audit gap' and 'fragile assurance' from an asserted mismatch between governance requirements (unobservables such as hidden objectives) and behavioral methods' observable limits, then analyzes a 21-instrument inventory to identify an incentive gradient. No mathematical derivations, fitted parameters, or equations exist. No self-citations are invoked as load-bearing premises, and the central claim does not reduce by construction to its own inputs or prior author work. The argument is interpretive and definitional rather than circular; any weakness lies in evidentiary support for the regulatory-requirement premise, not in self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The argument rests on domain assumptions about what current governance texts require and what behavioral methods can access; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption Governance frameworks require verifiable evidence of latent properties (hidden objectives, long-horizon behaviors) rather than only observable outputs.
    Stated in the abstract as the basis for the audit gap claim.
  • domain assumption Behavioral evaluations and red-teaming are epistemically limited to observable model outputs.
    Core premise used to conclude that current methods cannot meet the required verification.
invented entities (2)
  • audit gap no independent evidence
    purpose: Name for the divergence between required and achievable verification access.
    Conceptual label introduced in the paper; no independent evidence provided.
  • fragile assurance no independent evidence
    purpose: Name for cases where evidential structure does not support the asserted safety claim.
    Conceptual label introduced in the paper; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5724 in / 1410 out tokens · 21857 ms · 2026-06-30T20:50:56.523809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 25 canonical work pages · 4 internal anchors

  1. [1]

    Regulation (EU) 2024/1689 on harmonised rules on artificial intelligence (AI act), June 2024

    European Parliament and Council. Regulation (EU) 2024/1689 on harmonised rules on artificial intelligence (AI act), June 2024

  2. [2]

    Senate bill 53: Transparency in frontier artificial intelligence act (TFAIA), September 2025

    State of California. Senate bill 53: Transparency in frontier artificial intelligence act (TFAIA), September 2025

  3. [3]

    Framework convention on artificial intelligence and human rights, democracy and the rule of law, September 2024

    Council of Europe. Framework convention on artificial intelligence and human rights, democracy and the rule of law, September 2024

  4. [4]

    Black-box access is insufficient for rigorous AI audits

    Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. Black-box access is insu...

  5. [5]

    Model evaluation for extreme risks.arXiv preprint arXiv:2305.15324, 2023

    Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokota- jlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation for extreme risks.arXiv pre...

  6. [6]

    Safety cases for frontier AI

    Marie Davidsen Buhl, Gaurav Sett, Leonie Koessler, Jonas Schuett, and Markus Anderljung. Safety cases for frontier AI. arXiv:2410.21572, 2024

  7. [7]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

  8. [8]

    Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K

    Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K. Troy. Agentic misalignment: How LLMs could be insider threats.arXiv preprint arXiv:2510.05179, October 2025

  9. [9]

    Year in review 2025, 2025

    UK AI Security Institute. Year in review 2025, 2025

  10. [10]

    Ziegler, Elizabeth Barnes, and Lawrence Chan

    Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...

  11. [11]

    Frontier Models are Capable of In-context Scheming

    Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984, 2024

  12. [12]

    McKenzie, Oskar J

    Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, and Adam Gleave. STACK: Adversarial attacks on LLM safeguard pipelines.arXiv preprint arXiv:2506.24068, 2025

  13. [13]

    Senate bill 1047: Safe and secure innovation for frontier artificial intelligence models act (vetoed), 2024

    State of California. Senate bill 1047: Safe and secure innovation for frontier artificial intelligence models act (vetoed), 2024

  14. [14]

    Frontier compliance framework (SB-53), December 2025

    Anthropic. Frontier compliance framework (SB-53), December 2025

  15. [15]

    From surveillance to signalling: escalation channels as environmental controls for agentic AI

    Francesca Gomez. Adapting insider risk mitigations for agentic misalignment: An empirical study.arXiv preprint arXiv:2510.05192, October 2025

  16. [16]

    Safety-case research programme

    UK AI Security Institute. Safety-case research programme. Technical report, UK AISI, 2025

  17. [17]

    A pragmatic vision for interpretability

    Neel Nanda et al. A pragmatic vision for interpretability. Alignment Forum, December 2025

  18. [18]

    An approach to technical AGI safety and security

    Rohin Shah et al. An approach to technical AGI safety and security. Google DeepMind, 2025

  19. [19]

    Structured access: An emerging paradigm for safe AI deployment

    Toby Shevlane. Structured access: An emerging paradigm for safe AI deployment. Centre for the Governance of AI working paper, 2022

  20. [20]

    Kochenderfer, and Robert Trager

    Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, Neel Guha,...

  21. [21]

    Bucknall and Robert F

    Benjamin S. Bucknall and Robert F. Trager. Structured access for third-party research on frontier AI models. GovAI working paper, 2023

  22. [22]

    Expanding external access to frontier AI models for dangerous capability evaluations.arXiv preprint arXiv:2601.11916, January 2026

    Jacob Charnock, Alejandro Tlaie, Kyle O’Brien, Stephen Casper, and Aidan Homewood. Expanding external access to frontier AI models for dangerous capability evaluations.arXiv preprint arXiv:2601.11916, January 2026. 11 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

  23. [23]

    Verifying international agreements on AI: Six layers of verification for rules on large-scale AI development and deployment.arXiv preprint arXiv:2507.15916, July 2025

    Mauricio Baker, Gabriel Kulp, Oliver Marks, Miles Brundage, and Lennart Heim. Verifying international agreements on AI: Six layers of verification for rules on large-scale AI development and deployment.arXiv preprint arXiv:2507.15916, July 2025

  24. [24]

    Auditing large language models: A three-layered approach.AI and Ethics, 4(4):1085–1115, 2024

    Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, and Luciano Floridi. Auditing large language models: A three-layered approach.AI and Ethics, 4(4):1085–1115, 2024

  25. [25]

    A structured approach to safety case construction for AI systems.arXiv preprint arXiv:2601.22773, 2026

    Sung Une Lee, Liming Zhu, Md Shamsujjoha, Liming Dong, Qinghua Lu, Jieshan Chen, and Lionel Briand. A structured approach to safety case construction for AI systems.arXiv preprint arXiv:2601.22773, 2026

  26. [26]

    Kerry et al

    Cameron F. Kerry et al. Is AI sovereignty possible? lessons from semiconductor and cloud policy. Technical report, Brookings Institution, February 2026

  27. [27]

    AI compute sovereignty: Infrastructure control across territories, cloud providers, and accelerators.SSRN Working Paper, June 2025

    Zoe Jay Hawkins, Vili Lehdonvirta, and Boxi Wu. AI compute sovereignty: Infrastructure control across territories, cloud providers, and accelerators.SSRN Working Paper, June 2025

  28. [28]

    Lennart Heim. Understanding the artificial intelligence diffusion framework: Can export controls create a U.S.-led global artificial intelligence ecosystem? RAND Perspective PEA3776-1, RAND Corporation, January 2025

  29. [29]

    The sovereign AI agenda: Tech:forward survey

    McKinsey & Company. The sovereign AI agenda: Tech:forward survey. Technical report, McKinsey, December 2025

  30. [30]

    Carl D. Liggio. The expectation gap: The accountant’s legal Waterloo.Journal of Contemporary Business, 3(3):27–44, 1974

  31. [31]

    An empirical study of the audit expectation-performance gap.Accounting and Business Research, 24(93):49–68, 1993

    Brenda Porter. An empirical study of the audit expectation-performance gap.Accounting and Business Research, 24(93):49–68, 1993

  32. [32]

    The audit expectations gap in Britain: An empirical investigation.Accounting and Business Research, 23(sup1):395–411, 1993

    Christopher Humphrey, Peter Moizer, and Stuart Turley. The audit expectations gap in Britain: An empirical investigation.Accounting and Business Research, 23(sup1):395–411, 1993

  33. [33]

    Audit expectation gap: Concept, nature and trace.African Journal of Business Management, 5(21):8376–8392, 2011

    Mahdi Salehi. Audit expectation gap: Concept, nature and trace.African Journal of Business Management, 5(21):8376–8392, 2011

  34. [34]

    UK AISI alignment evaluation case-study.arXiv preprint arXiv:2604.00788, April 2026

    Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D’Cruz, and Xander Davies. UK AISI alignment evaluation case-study.arXiv preprint arXiv:2604.00788, April 2026

  35. [35]

    Time horizon 1.1

    METR. Time horizon 1.1. METR Blog, January 2026

  36. [36]

    Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, September 2025

    Bronson Schoen, Evgenia Nitishinskaya, Carson Denison, Alon Karpas, Kushal Tirumala, Boaz Wang, Tomek Korbak, Samuel Marks, Stephanie Lin, Catherine Olsson, Sara Riedel, Henry Sleight, Fabien Roger, Marius Hobbhahn, Mikita Balesni, et al. Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, September 2025

  37. [37]

    Detecting and reducing scheming in AI models

    OpenAI. Detecting and reducing scheming in AI models. OpenAI Blog (collaboration with Apollo Research), September 2025

  38. [38]

    Mechanistic interpretability for AI safety – a review.Transactions on Machine Learning Research (TMLR), 2024

    Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for AI safety – a review.Transactions on Machine Learning Research (TMLR), 2024

  39. [39]

    Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath

    Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...

  40. [40]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

  41. [41]

    Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Adam Jermyn, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

  42. [42]

    Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

  43. [43]

    Sparse autoencoders for hypothesis generation.arXiv preprint arXiv:2502.04382, February 2025

    Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, and Emma Pierson. Sparse autoencoders for hypothesis generation.arXiv preprint arXiv:2502.04382, February 2025

  44. [44]

    Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research

    Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Joseph Bloom, Neel Nanda, et al. Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research. DeepMind Safety Research, Alignment Forum, March 2025. 12 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

  45. [45]

    Are sparse autoencoders useful? a case study in sparse probing

    Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. InProceedings of the 42nd International Conference on Machine Learning (ICML), pages 29018–29049, 2025

  46. [46]

    Manning, and Christopher Potts

    Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. AxBench: Steering LLMs? Even simple baselines outperform sparse autoencoders. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

  47. [47]

    Rogov, Ivan Oseledets, and Elena Tutubalina

    Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y . Rogov, Ivan Oseledets, and Elena Tutubalina. Sanity checks for sparse autoencoders: Do SAEs beat random baselines?arXiv preprint arXiv:2602.14111, February 2026

  48. [48]

    When the coffee feature activates on coffins: An analysis of feature extraction and steering for mechanistic interpretability.arXiv preprint arXiv:2601.03047, January 2026

    Raphael Ronge, Markus Maier, and Frederick Eberhardt. When the coffee feature activates on coffins: An analysis of feature extraction and steering for mechanistic interpretability.arXiv preprint arXiv:2601.03047, January 2026

  49. [49]

    The secret agenda: LLMs strategically lie and our current safety tools are blind.arXiv preprint arXiv:2509.20393, September 2025

    Caleb DeLeeuw, Gaurav Chawla, Aniket Sharma, and Vanessa Dietze. The secret agenda: LLMs strategically lie and our current safety tools are blind.arXiv preprint arXiv:2509.20393, September 2025

  50. [50]

    The urgency of interpretability

    Dario Amodei. The urgency of interpretability. Anthropic, April 2025

  51. [51]

    Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, March 2025

    Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, et al. Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, March 2025

  52. [52]

    Bowman, Sara Price, Samuel Marks, and Rowan Wang

    Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, and Rowan Wang. AuditBench: Evaluating alignment auditing techniques on models with hidden behaviors.arXiv preprint arXiv:2602.22755, March 2026

  53. [53]

    Bowman, Trenton Bricken, Alex Cloud, Misha Wagner, Rowan Wang, Evan Hubinger, Fabien Roger, and Samuel Marks

    Johannes Treutlein, Samuel R. Bowman, Trenton Bricken, Alex Cloud, Misha Wagner, Rowan Wang, Evan Hubinger, Fabien Roger, and Samuel Marks. Pre-deployment auditing can catch an overt saboteur. Anthropic Alignment Science Blog, January 2026

  54. [54]

    Farley and Christian R

    Edwin A. Farley and Christian R. Lansang. AI auditing: First steps towards the effective regulation of artificial intelligence systems.Harvard Journal of Law & Technology, Digest, February 2025

  55. [55]

    Conformity assessments under the EU AI Act: A practical guide, 2025

    Future of Privacy Forum and OneTrust. Conformity assessments under the EU AI Act: A practical guide, 2025

  56. [56]

    International AI safety report 2026

    Yoshua Bengio et al. International AI safety report 2026. Technical report, Commissioned by the International AI Safety Report Secretariat, February 2026

  57. [57]

    Singer, Ruth E

    Rishi Bommasani, Scott R. Singer, Ruth E. Appel, Sarah Cen, A. Feder Cooper, Elena Cryst, Lindsey A. Gailmard, Ian Klaus, Meredith M. Lee, Inioluwa Deborah Raji, Anka Reuel, Drew Spence, Alexander Wan, Angelina Wang, Daniel Zhang, Daniel E. Ho, Percy Liang, Dawn Song, Joseph E. Gonzalez, Jonathan Zittrain, Jennifer Tour Chayes, Mariano-Florentino Cuéllar,...

  58. [58]

    2025 AI safety index, 2025

    Future of Life Institute. 2025 AI safety index, 2025

  59. [59]

    Model AI governance framework for generative AI, 2024

    Infocomm Media Development Authority. Model AI governance framework for generative AI, 2024

  60. [60]

    AI Verify: Testing framework and toolkit, 2024

    AI Verify Foundation. AI Verify: Testing framework and toolkit, 2024

  61. [61]

    Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025

    Qingyu Yin, Chak Tou Leong, Wenxuan Huang, Wenjie Li, Linyi Yang, Xiting Wang, Jaehong Yoon, YunXing, XingYu, and Jinjin Gu. Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025

  62. [62]

    Detecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

    Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. Detecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

  63. [63]

    AI governance must move from ‘point-in-time’ audits to ‘living’ compliance

    Vikram Singh. AI governance must move from ‘point-in-time’ audits to ‘living’ compliance. LSE Business Review, February 2026

  64. [64]

    A sketch of an AI control safety case.arXiv preprint arXiv:2501.17315, January 2025

    Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving. A sketch of an AI control safety case.arXiv preprint arXiv:2501.17315, January 2025

  65. [65]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

  66. [66]

    Linear explanations for individual neurons

    Tuomas Oikarinen and Tsui-Wei Weng. Linear explanations for individual neurons. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  67. [67]

    Analyzing the structure of attention in a transformer language model

    Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, 2019

  68. [68]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Kevin R. Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small.arXiv preprint arXiv:2211.00593, 2022. 13 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

  69. [69]

    Beresford

    Christoph Schnabl, Daniel Hugenroth, Bill Marino, and Alastair R. Beresford. Attestable audits: Verifiable AI safety benchmarks using trusted execution environments.arXiv preprint arXiv:2506.23706, June 2025

  70. [70]

    Thomas Hou, and Wenjing Lou

    Heng Jin, Chaoyu Zhang, Hexuan Yu, Shanghao Shi, Ning Zhang, Y . Thomas Hou, and Wenjing Lou. Trusting what you cannot see: Auditable fine-tuning and inference for proprietary AI.arXiv preprint arXiv:2603.07466, March 2026

  71. [71]

    Beyond privacy: Structured transparency through secure enclaves

    Andrew Trask et al. Beyond privacy: Structured transparency through secure enclaves. Working paper, 2024

  72. [72]

    Zero-knowledge proof protocols for ML model verification

    Sakda Waiwitlikhit et al. Zero-knowledge proof protocols for ML model verification. Working paper, 2024

  73. [73]

    General-purpose AI code of practice (final).https://digital-strategy.ec.europa

    European Commission. General-purpose AI code of practice (final).https://digital-strategy.ec.europa. eu/, July 2025

  74. [74]

    Artificial intelligence risk management framework (AI RMF 1.0)

    National Institute of Standards and Technology. Artificial intelligence risk management framework (AI RMF 1.0). Technical Report AI 100-1, NIST, January 2023

  75. [75]

    Artificial intelligence risk management framework: Generative AI profile

    National Institute of Standards and Technology. Artificial intelligence risk management framework: Generative AI profile. Technical Report AI 600-1, NIST, July 2024

  76. [76]

    V oluntary code of conduct on the responsible develop- ment and management of advanced generative AI systems, 2023

    Innovation, Science and Economic Development Canada. V oluntary code of conduct on the responsible develop- ment and management of advanced generative AI systems, 2023

  77. [77]

    V oluntary AI safety standard, 2024

    Department of Industry, Science and Resources (Australia). V oluntary AI safety standard, 2024

  78. [78]

    AI guidelines for business, version 1.0, 2024

    Ministry of Economy, Trade and Industry (Japan). AI guidelines for business, version 1.0, 2024

  79. [79]

    Act on the development of artificial intelligence and establishment of a foundation for trust (AI Basic Act), December 2024

    National Assembly of the Republic of Korea. Act on the development of artificial intelligence and establishment of a foundation for trust (AI Basic Act), December 2024

  80. [80]

    AI governance guidelines, February 2026

    Ministry of Electronics and Information Technology (India). AI governance guidelines, February 2026

Showing first 80 references.