Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

Pratinav Seth; Vinay Kumar Sankarapu

arxiv: 2605.15164 · v1 · pith:KJ6XEAW7new · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

Pratinav Seth , Vinay Kumar Sankarapu This is my paper

Pith reviewed 2026-06-30 20:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords behavioural assuranceaudit gapfragile assuranceAI governancemechanistic evidencered-teamingloss of controlsafety verification

0 comments

The pith

Behavioural assurance cannot verify the unobservable safety properties that current AI governance frameworks require.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that frameworks enacted from 2019 to early 2026 demand reviewable evidence for properties such as absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability. Behavioural evaluations and red-teaming are limited to observable model outputs and therefore cannot access the latent representations or long-horizon behaviours these requirements presuppose. This mismatch is formalised as the audit gap, which produces fragile assurance where the evidential base does not support the asserted safety claim. Analysis of 21 instruments reveals an incentive gradient that rewards surface-level proxies, and the authors advocate bounding the legal weight of behavioural evidence while adding mechanistic-evidence classes such as linear probes and activation patching.

Core claim

Behavioural assurance, even when carefully designed, is epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours that governance frameworks presume to regulate; this structural mismatch is formalised as the audit gap, with current methodologies producing fragile assurance in which the evidential structure fails to support the asserted safety claims.

What carries the argument

The audit gap: the divergence between the verification access required by governance frameworks and the verification access achievable by behavioural methods.

If this is right

Governance frameworks over-reach the epistemic limits of behavioural assurance and therefore rest on fragile evidence.
Incentive structures in industry and geopolitics systematically favour observable proxies over structural verification.
Legal text should explicitly bound the weight given to behavioural evidence.
Pre-deployment access should be extended to include mechanistic-evidence classes such as linear probes, activation patching, and before/after-training comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the audit gap holds, regulators may need to narrow the safety properties they attempt to verify to those that remain observable.
Adoption of mechanistic methods could shift assurance practice from post-training testing toward inspection of internal representations during development.
The distinction between behavioural and mechanistic evidence may become a new axis in international AI standards discussions.

Load-bearing premise

Governance frameworks actually require verifiable evidence of unobservable internal properties rather than merely observable behavioral outputs.

What would settle it

A demonstration that behavioural red-teaming or evaluations alone can reliably confirm the absence of hidden objectives or resistance to long-horizon loss-of-control in a model where mechanistic methods later reveal such features.

read the original abstract

This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper points out that behavioral tests can't deliver on the unobservable safety properties demanded by recent AI governance, but it needs more specific evidence from the 21 instruments to make the mismatch stick.

read the letter

Behavioral assurance cannot verify the safety claims that current governance frameworks are asking for. That's the core position, and it's worth taking seriously because behavioral tests really are limited to what the model outputs.

The paper introduces the 'audit gap' as the mismatch between required verification and what behavioral methods can access, and 'fragile assurance' for when the evidence doesn't support the claim. It reviews 21 instruments and points to incentives that favor easy behavioral proxies. The proposal to bound behavioral evidence in legal text and add mechanistic methods like linear probes is a concrete suggestion.

This framing is new and the inventory is a useful starting point for discussion. It does a good job highlighting the epistemic limits without overclaiming technical results.

The main weakness is that the argument hinges on governance requiring evidence of unobservables like absence of hidden objectives. The abstract states this, but there's no quoted language or detailed categorization showing that the instruments go beyond behavioral proxies. If the requirements are mostly about observable benchmarks and red-teaming, the structural mismatch shrinks. The 21-instrument part is described at high level only, so it's hard to judge how systematic the incentive gradient is.

This paper is aimed at AI governance researchers and policymakers who care about aligning technical assurance with regulatory demands. A reader interested in the intersection of policy and technical limits will get value from the concepts, even if they want more evidence.

It deserves serious referee time because the issue is important and the position is coherent on its own terms. The work engages with real regulatory developments.

I'd recommend sending it for peer review, but ask the authors to strengthen the mapping from specific instruments to the claimed requirements by including more direct quotes or a breakdown table.

Referee Report

2 major / 1 minor

Summary. This position paper claims that behavioural assurance methods (evaluations, red-teaming) are epistemically limited to observable outputs and cannot verify the latent properties (absence of hidden objectives, resistance to long-horizon loss-of-control) that AI governance frameworks enacted 2019–early 2026 are said to require. It formalises the resulting 'audit gap' and 'fragile assurance', reports a high-level analysis of a 21-instrument inventory that identifies an incentive gradient favouring surface proxies, and proposes bounding the legal weight of behavioural evidence while extending access for mechanistic techniques such as linear probes and activation patching.

Significance. If the central premise holds, the paper identifies a structural mismatch between regulatory demands and current verification capabilities that could systematically produce over-claimed safety assurances; the proposed pivot toward mechanistic evidence classes would be a concrete policy-technical response. The manuscript supplies no quantitative data, derivations, or machine-checked results, so its contribution is conceptual and rests entirely on the accuracy of the 21-instrument reading.

major comments (2)

[21-instrument inventory analysis] The 21-instrument inventory analysis (described in the abstract and the section presenting the inventory): the claim that the frameworks 'require reviewable evidence of properties such as the absence of hidden objectives' is load-bearing for the audit-gap and fragile-assurance definitions, yet the manuscript supplies no quoted statutory or guidance language, no categorization table, and no explicit mapping showing which instruments use terms that exceed observable behavioural proxies. Without this, the mismatch remains an interpretive assertion rather than a demonstrated divergence.
[formalisation of audit gap] Definition of the audit gap (early formalisation section): the gap is defined as the divergence between 'required' and 'achievable' verification access; if the 21 instruments in fact demand only observable risk assessments, red-teaming outputs, or capability benchmarks (as the skeptic note flags), the gap is empty by construction and the subsequent incentive-gradient claim loses its target.

minor comments (1)

[proposal section] The abstract and proposal section refer to 'linear probes, activation patching, and before/after-training comparisons' without indicating whether these are intended as mandatory or merely voluntary supplements; a brief clarification of scope would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and substantive review. We agree that the 21-instrument analysis is presented at too high a level and that the load-bearing interpretive step requires explicit support. We will revise the manuscript to include the requested mappings, quotes, and categorization. Responses to the major comments follow.

read point-by-point responses

Referee: [21-instrument inventory analysis] The 21-instrument inventory analysis (described in the abstract and the section presenting the inventory): the claim that the frameworks 'require reviewable evidence of properties such as the absence of hidden objectives' is load-bearing for the audit-gap and fragile-assurance definitions, yet the manuscript supplies no quoted statutory or guidance language, no categorization table, and no explicit mapping showing which instruments use terms that exceed observable behavioural proxies. Without this, the mismatch remains an interpretive assertion rather than a demonstrated divergence.

Authors: We accept the point. The manuscript currently summarizes the inventory findings without supplying the underlying statutory language or a mapping table. In revision we will add an appendix containing a table that lists each of the 21 instruments, quotes the relevant passages concerning required evidence, and classifies each passage according to whether it refers only to observable behavioural outputs or also to latent properties (e.g., absence of hidden objectives or long-horizon loss-of-control resistance). This will convert the interpretive claim into an explicit, checkable mapping. revision: yes
Referee: [formalisation of audit gap] Definition of the audit gap (early formalisation section): the gap is defined as the divergence between 'required' and 'achievable' verification access; if the 21 instruments in fact demand only observable risk assessments, red-teaming outputs, or capability benchmarks (as the skeptic note flags), the gap is empty by construction and the subsequent incentive-gradient claim loses its target.

Authors: The audit-gap definition is deliberately conditional on the claim that the instruments require evidence of latent properties. We will revise the formalisation section to cross-reference the new mapping table, thereby making the divergence between required and achievable verification explicit rather than asserted. Should the added quotations show that all instruments are limited to observable outputs, we would retract the gap claim; our current reading, however, identifies language in multiple instruments that exceeds observable proxies. revision: partial

Circularity Check

0 steps flagged

No significant circularity; definitional argument with no equations, fits, or load-bearing self-citations

full rationale

The paper is a position piece that defines the 'audit gap' and 'fragile assurance' from an asserted mismatch between governance requirements (unobservables such as hidden objectives) and behavioral methods' observable limits, then analyzes a 21-instrument inventory to identify an incentive gradient. No mathematical derivations, fitted parameters, or equations exist. No self-citations are invoked as load-bearing premises, and the central claim does not reduce by construction to its own inputs or prior author work. The argument is interpretive and definitional rather than circular; any weakness lies in evidentiary support for the regulatory-requirement premise, not in self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The argument rests on domain assumptions about what current governance texts require and what behavioral methods can access; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Governance frameworks require verifiable evidence of latent properties (hidden objectives, long-horizon behaviors) rather than only observable outputs.
Stated in the abstract as the basis for the audit gap claim.
domain assumption Behavioral evaluations and red-teaming are epistemically limited to observable model outputs.
Core premise used to conclude that current methods cannot meet the required verification.

invented entities (2)

audit gap no independent evidence
purpose: Name for the divergence between required and achievable verification access.
Conceptual label introduced in the paper; no independent evidence provided.
fragile assurance no independent evidence
purpose: Name for cases where evidential structure does not support the asserted safety claim.
Conceptual label introduced in the paper; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5724 in / 1410 out tokens · 21857 ms · 2026-06-30T20:50:56.523809+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 25 canonical work pages · 4 internal anchors

[1]

Regulation (EU) 2024/1689 on harmonised rules on artificial intelligence (AI act), June 2024

European Parliament and Council. Regulation (EU) 2024/1689 on harmonised rules on artificial intelligence (AI act), June 2024

2024
[2]

Senate bill 53: Transparency in frontier artificial intelligence act (TFAIA), September 2025

State of California. Senate bill 53: Transparency in frontier artificial intelligence act (TFAIA), September 2025

2025
[3]

Framework convention on artificial intelligence and human rights, democracy and the rule of law, September 2024

Council of Europe. Framework convention on artificial intelligence and human rights, democracy and the rule of law, September 2024

2024
[4]

Black-box access is insufficient for rigorous AI audits

Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. Black-box access is insu...

2024
[5]

Model evaluation for extreme risks.arXiv preprint arXiv:2305.15324, 2023

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokota- jlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation for extreme risks.arXiv pre...

work page arXiv 2023
[6]

Safety cases for frontier AI

Marie Davidsen Buhl, Gaurav Sett, Leonie Koessler, Jonas Schuett, and Markus Anderljung. Safety cases for frontier AI. arXiv:2410.21572, 2024

work page arXiv 2024
[7]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

2022
[8]

Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K. Troy. Agentic misalignment: How LLMs could be insider threats.arXiv preprint arXiv:2510.05179, October 2025

work page arXiv 2025
[9]

Year in review 2025, 2025

UK AI Security Institute. Year in review 2025, 2025

2025
[10]

Ziegler, Elizabeth Barnes, and Lawrence Chan

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...

work page arXiv 2025
[11]

Frontier Models are Capable of In-context Scheming

Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

McKenzie, Oskar J

Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, and Adam Gleave. STACK: Adversarial attacks on LLM safeguard pipelines.arXiv preprint arXiv:2506.24068, 2025

work page arXiv 2025
[13]

Senate bill 1047: Safe and secure innovation for frontier artificial intelligence models act (vetoed), 2024

State of California. Senate bill 1047: Safe and secure innovation for frontier artificial intelligence models act (vetoed), 2024

2024
[14]

Frontier compliance framework (SB-53), December 2025

Anthropic. Frontier compliance framework (SB-53), December 2025

2025
[15]

From surveillance to signalling: escalation channels as environmental controls for agentic AI

Francesca Gomez. Adapting insider risk mitigations for agentic misalignment: An empirical study.arXiv preprint arXiv:2510.05192, October 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Safety-case research programme

UK AI Security Institute. Safety-case research programme. Technical report, UK AISI, 2025

2025
[17]

A pragmatic vision for interpretability

Neel Nanda et al. A pragmatic vision for interpretability. Alignment Forum, December 2025

2025
[18]

An approach to technical AGI safety and security

Rohin Shah et al. An approach to technical AGI safety and security. Google DeepMind, 2025

2025
[19]

Structured access: An emerging paradigm for safe AI deployment

Toby Shevlane. Structured access: An emerging paradigm for safe AI deployment. Centre for the Governance of AI working paper, 2022

2022
[20]

Kochenderfer, and Robert Trager

Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, Neel Guha,...

2025
[21]

Bucknall and Robert F

Benjamin S. Bucknall and Robert F. Trager. Structured access for third-party research on frontier AI models. GovAI working paper, 2023

2023
[22]

Expanding external access to frontier AI models for dangerous capability evaluations.arXiv preprint arXiv:2601.11916, January 2026

Jacob Charnock, Alejandro Tlaie, Kyle O’Brien, Stephen Casper, and Aidan Homewood. Expanding external access to frontier AI models for dangerous capability evaluations.arXiv preprint arXiv:2601.11916, January 2026. 11 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

work page arXiv 2026
[23]

Verifying international agreements on AI: Six layers of verification for rules on large-scale AI development and deployment.arXiv preprint arXiv:2507.15916, July 2025

Mauricio Baker, Gabriel Kulp, Oliver Marks, Miles Brundage, and Lennart Heim. Verifying international agreements on AI: Six layers of verification for rules on large-scale AI development and deployment.arXiv preprint arXiv:2507.15916, July 2025

work page arXiv 2025
[24]

Auditing large language models: A three-layered approach.AI and Ethics, 4(4):1085–1115, 2024

Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, and Luciano Floridi. Auditing large language models: A three-layered approach.AI and Ethics, 4(4):1085–1115, 2024

2024
[25]

A structured approach to safety case construction for AI systems.arXiv preprint arXiv:2601.22773, 2026

Sung Une Lee, Liming Zhu, Md Shamsujjoha, Liming Dong, Qinghua Lu, Jieshan Chen, and Lionel Briand. A structured approach to safety case construction for AI systems.arXiv preprint arXiv:2601.22773, 2026

work page arXiv 2026
[26]

Kerry et al

Cameron F. Kerry et al. Is AI sovereignty possible? lessons from semiconductor and cloud policy. Technical report, Brookings Institution, February 2026

2026
[27]

AI compute sovereignty: Infrastructure control across territories, cloud providers, and accelerators.SSRN Working Paper, June 2025

Zoe Jay Hawkins, Vili Lehdonvirta, and Boxi Wu. AI compute sovereignty: Infrastructure control across territories, cloud providers, and accelerators.SSRN Working Paper, June 2025

2025
[28]

Lennart Heim. Understanding the artificial intelligence diffusion framework: Can export controls create a U.S.-led global artificial intelligence ecosystem? RAND Perspective PEA3776-1, RAND Corporation, January 2025

2025
[29]

The sovereign AI agenda: Tech:forward survey

McKinsey & Company. The sovereign AI agenda: Tech:forward survey. Technical report, McKinsey, December 2025

2025
[30]

Carl D. Liggio. The expectation gap: The accountant’s legal Waterloo.Journal of Contemporary Business, 3(3):27–44, 1974

1974
[31]

An empirical study of the audit expectation-performance gap.Accounting and Business Research, 24(93):49–68, 1993

Brenda Porter. An empirical study of the audit expectation-performance gap.Accounting and Business Research, 24(93):49–68, 1993

1993
[32]

The audit expectations gap in Britain: An empirical investigation.Accounting and Business Research, 23(sup1):395–411, 1993

Christopher Humphrey, Peter Moizer, and Stuart Turley. The audit expectations gap in Britain: An empirical investigation.Accounting and Business Research, 23(sup1):395–411, 1993

1993
[33]

Audit expectation gap: Concept, nature and trace.African Journal of Business Management, 5(21):8376–8392, 2011

Mahdi Salehi. Audit expectation gap: Concept, nature and trace.African Journal of Business Management, 5(21):8376–8392, 2011

2011
[34]

UK AISI alignment evaluation case-study.arXiv preprint arXiv:2604.00788, April 2026

Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D’Cruz, and Xander Davies. UK AISI alignment evaluation case-study.arXiv preprint arXiv:2604.00788, April 2026

work page arXiv 2026
[35]

Time horizon 1.1

METR. Time horizon 1.1. METR Blog, January 2026

2026
[36]

Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, September 2025

Bronson Schoen, Evgenia Nitishinskaya, Carson Denison, Alon Karpas, Kushal Tirumala, Boaz Wang, Tomek Korbak, Samuel Marks, Stephanie Lin, Catherine Olsson, Sara Riedel, Henry Sleight, Fabien Roger, Marius Hobbhahn, Mikita Balesni, et al. Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, September 2025

work page arXiv 2025
[37]

Detecting and reducing scheming in AI models

OpenAI. Detecting and reducing scheming in AI models. OpenAI Blog (collaboration with Apollo Research), September 2025

2025
[38]

Mechanistic interpretability for AI safety – a review.Transactions on Machine Learning Research (TMLR), 2024

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for AI safety – a review.Transactions on Machine Learning Research (TMLR), 2024

2024
[39]

Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...

2025
[40]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

2023
[41]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Adam Jermyn, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

2024
[42]

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

2025
[43]

Sparse autoencoders for hypothesis generation.arXiv preprint arXiv:2502.04382, February 2025

Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, and Emma Pierson. Sparse autoencoders for hypothesis generation.arXiv preprint arXiv:2502.04382, February 2025

work page arXiv 2025
[44]

Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research

Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Joseph Bloom, Neel Nanda, et al. Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research. DeepMind Safety Research, Alignment Forum, March 2025. 12 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

2025
[45]

Are sparse autoencoders useful? a case study in sparse probing

Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. InProceedings of the 42nd International Conference on Machine Learning (ICML), pages 29018–29049, 2025

2025
[46]

Manning, and Christopher Potts

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. AxBench: Steering LLMs? Even simple baselines outperform sparse autoencoders. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

2025
[47]

Rogov, Ivan Oseledets, and Elena Tutubalina

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y . Rogov, Ivan Oseledets, and Elena Tutubalina. Sanity checks for sparse autoencoders: Do SAEs beat random baselines?arXiv preprint arXiv:2602.14111, February 2026

work page arXiv 2026
[48]

When the coffee feature activates on coffins: An analysis of feature extraction and steering for mechanistic interpretability.arXiv preprint arXiv:2601.03047, January 2026

Raphael Ronge, Markus Maier, and Frederick Eberhardt. When the coffee feature activates on coffins: An analysis of feature extraction and steering for mechanistic interpretability.arXiv preprint arXiv:2601.03047, January 2026

work page arXiv 2026
[49]

The secret agenda: LLMs strategically lie and our current safety tools are blind.arXiv preprint arXiv:2509.20393, September 2025

Caleb DeLeeuw, Gaurav Chawla, Aniket Sharma, and Vanessa Dietze. The secret agenda: LLMs strategically lie and our current safety tools are blind.arXiv preprint arXiv:2509.20393, September 2025

work page arXiv 2025
[50]

The urgency of interpretability

Dario Amodei. The urgency of interpretability. Anthropic, April 2025

2025
[51]

Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, March 2025

Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, et al. Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, March 2025

work page arXiv 2025
[52]

Bowman, Sara Price, Samuel Marks, and Rowan Wang

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, and Rowan Wang. AuditBench: Evaluating alignment auditing techniques on models with hidden behaviors.arXiv preprint arXiv:2602.22755, March 2026

work page arXiv 2026
[53]

Bowman, Trenton Bricken, Alex Cloud, Misha Wagner, Rowan Wang, Evan Hubinger, Fabien Roger, and Samuel Marks

Johannes Treutlein, Samuel R. Bowman, Trenton Bricken, Alex Cloud, Misha Wagner, Rowan Wang, Evan Hubinger, Fabien Roger, and Samuel Marks. Pre-deployment auditing can catch an overt saboteur. Anthropic Alignment Science Blog, January 2026

2026
[54]

Farley and Christian R

Edwin A. Farley and Christian R. Lansang. AI auditing: First steps towards the effective regulation of artificial intelligence systems.Harvard Journal of Law & Technology, Digest, February 2025

2025
[55]

Conformity assessments under the EU AI Act: A practical guide, 2025

Future of Privacy Forum and OneTrust. Conformity assessments under the EU AI Act: A practical guide, 2025

2025
[56]

International AI safety report 2026

Yoshua Bengio et al. International AI safety report 2026. Technical report, Commissioned by the International AI Safety Report Secretariat, February 2026

2026
[57]

Singer, Ruth E

Rishi Bommasani, Scott R. Singer, Ruth E. Appel, Sarah Cen, A. Feder Cooper, Elena Cryst, Lindsey A. Gailmard, Ian Klaus, Meredith M. Lee, Inioluwa Deborah Raji, Anka Reuel, Drew Spence, Alexander Wan, Angelina Wang, Daniel Zhang, Daniel E. Ho, Percy Liang, Dawn Song, Joseph E. Gonzalez, Jonathan Zittrain, Jennifer Tour Chayes, Mariano-Florentino Cuéllar,...

2025
[58]

2025 AI safety index, 2025

Future of Life Institute. 2025 AI safety index, 2025

2025
[59]

Model AI governance framework for generative AI, 2024

Infocomm Media Development Authority. Model AI governance framework for generative AI, 2024

2024
[60]

AI Verify: Testing framework and toolkit, 2024

AI Verify Foundation. AI Verify: Testing framework and toolkit, 2024

2024
[61]

Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025

Qingyu Yin, Chak Tou Leong, Wenxuan Huang, Wenjie Li, Linyi Yang, Xiting Wang, Jaehong Yoon, YunXing, XingYu, and Jinjin Gu. Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025

work page arXiv 2025
[62]

Detecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. Detecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

work page arXiv 2025
[63]

AI governance must move from ‘point-in-time’ audits to ‘living’ compliance

Vikram Singh. AI governance must move from ‘point-in-time’ audits to ‘living’ compliance. LSE Business Review, February 2026

2026
[64]

A sketch of an AI control safety case.arXiv preprint arXiv:2501.17315, January 2025

Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving. A sketch of an AI control safety case.arXiv preprint arXiv:2501.17315, January 2025

work page arXiv 2025
[65]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Linear explanations for individual neurons

Tuomas Oikarinen and Tsui-Wei Weng. Linear explanations for individual neurons. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[67]

Analyzing the structure of attention in a transformer language model

Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, 2019

2019
[68]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Kevin R. Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small.arXiv preprint arXiv:2211.00593, 2022. 13 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

work page internal anchor Pith review Pith/arXiv arXiv 2022
[69]

Beresford

Christoph Schnabl, Daniel Hugenroth, Bill Marino, and Alastair R. Beresford. Attestable audits: Verifiable AI safety benchmarks using trusted execution environments.arXiv preprint arXiv:2506.23706, June 2025

work page arXiv 2025
[70]

Thomas Hou, and Wenjing Lou

Heng Jin, Chaoyu Zhang, Hexuan Yu, Shanghao Shi, Ning Zhang, Y . Thomas Hou, and Wenjing Lou. Trusting what you cannot see: Auditable fine-tuning and inference for proprietary AI.arXiv preprint arXiv:2603.07466, March 2026

work page arXiv 2026
[71]

Beyond privacy: Structured transparency through secure enclaves

Andrew Trask et al. Beyond privacy: Structured transparency through secure enclaves. Working paper, 2024

2024
[72]

Zero-knowledge proof protocols for ML model verification

Sakda Waiwitlikhit et al. Zero-knowledge proof protocols for ML model verification. Working paper, 2024

2024
[73]

General-purpose AI code of practice (final).https://digital-strategy.ec.europa

European Commission. General-purpose AI code of practice (final).https://digital-strategy.ec.europa. eu/, July 2025

2025
[74]

Artificial intelligence risk management framework (AI RMF 1.0)

National Institute of Standards and Technology. Artificial intelligence risk management framework (AI RMF 1.0). Technical Report AI 100-1, NIST, January 2023

2023
[75]

Artificial intelligence risk management framework: Generative AI profile

National Institute of Standards and Technology. Artificial intelligence risk management framework: Generative AI profile. Technical Report AI 600-1, NIST, July 2024

2024
[76]

V oluntary code of conduct on the responsible develop- ment and management of advanced generative AI systems, 2023

Innovation, Science and Economic Development Canada. V oluntary code of conduct on the responsible develop- ment and management of advanced generative AI systems, 2023

2023
[77]

V oluntary AI safety standard, 2024

Department of Industry, Science and Resources (Australia). V oluntary AI safety standard, 2024

2024
[78]

AI guidelines for business, version 1.0, 2024

Ministry of Economy, Trade and Industry (Japan). AI guidelines for business, version 1.0, 2024

2024
[79]

Act on the development of artificial intelligence and establishment of a foundation for trust (AI Basic Act), December 2024

National Assembly of the Republic of Korea. Act on the development of artificial intelligence and establishment of a foundation for trust (AI Basic Act), December 2024

2024
[80]

AI governance guidelines, February 2026

Ministry of Electronics and Information Technology (India). AI governance guidelines, February 2026

2026

Showing first 80 references.

[1] [1]

Regulation (EU) 2024/1689 on harmonised rules on artificial intelligence (AI act), June 2024

European Parliament and Council. Regulation (EU) 2024/1689 on harmonised rules on artificial intelligence (AI act), June 2024

2024

[2] [2]

Senate bill 53: Transparency in frontier artificial intelligence act (TFAIA), September 2025

State of California. Senate bill 53: Transparency in frontier artificial intelligence act (TFAIA), September 2025

2025

[3] [3]

Framework convention on artificial intelligence and human rights, democracy and the rule of law, September 2024

Council of Europe. Framework convention on artificial intelligence and human rights, democracy and the rule of law, September 2024

2024

[4] [4]

Black-box access is insufficient for rigorous AI audits

Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. Black-box access is insu...

2024

[5] [5]

Model evaluation for extreme risks.arXiv preprint arXiv:2305.15324, 2023

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokota- jlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation for extreme risks.arXiv pre...

work page arXiv 2023

[6] [6]

Safety cases for frontier AI

Marie Davidsen Buhl, Gaurav Sett, Leonie Koessler, Jonas Schuett, and Markus Anderljung. Safety cases for frontier AI. arXiv:2410.21572, 2024

work page arXiv 2024

[7] [7]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

2022

[8] [8]

Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K. Troy. Agentic misalignment: How LLMs could be insider threats.arXiv preprint arXiv:2510.05179, October 2025

work page arXiv 2025

[9] [9]

Year in review 2025, 2025

UK AI Security Institute. Year in review 2025, 2025

2025

[10] [10]

Ziegler, Elizabeth Barnes, and Lawrence Chan

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...

work page arXiv 2025

[11] [11]

Frontier Models are Capable of In-context Scheming

Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

McKenzie, Oskar J

Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, and Adam Gleave. STACK: Adversarial attacks on LLM safeguard pipelines.arXiv preprint arXiv:2506.24068, 2025

work page arXiv 2025

[13] [13]

Senate bill 1047: Safe and secure innovation for frontier artificial intelligence models act (vetoed), 2024

State of California. Senate bill 1047: Safe and secure innovation for frontier artificial intelligence models act (vetoed), 2024

2024

[14] [14]

Frontier compliance framework (SB-53), December 2025

Anthropic. Frontier compliance framework (SB-53), December 2025

2025

[15] [15]

From surveillance to signalling: escalation channels as environmental controls for agentic AI

Francesca Gomez. Adapting insider risk mitigations for agentic misalignment: An empirical study.arXiv preprint arXiv:2510.05192, October 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Safety-case research programme

UK AI Security Institute. Safety-case research programme. Technical report, UK AISI, 2025

2025

[17] [17]

A pragmatic vision for interpretability

Neel Nanda et al. A pragmatic vision for interpretability. Alignment Forum, December 2025

2025

[18] [18]

An approach to technical AGI safety and security

Rohin Shah et al. An approach to technical AGI safety and security. Google DeepMind, 2025

2025

[19] [19]

Structured access: An emerging paradigm for safe AI deployment

Toby Shevlane. Structured access: An emerging paradigm for safe AI deployment. Centre for the Governance of AI working paper, 2022

2022

[20] [20]

Kochenderfer, and Robert Trager

Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, Neel Guha,...

2025

[21] [21]

Bucknall and Robert F

Benjamin S. Bucknall and Robert F. Trager. Structured access for third-party research on frontier AI models. GovAI working paper, 2023

2023

[22] [22]

Expanding external access to frontier AI models for dangerous capability evaluations.arXiv preprint arXiv:2601.11916, January 2026

Jacob Charnock, Alejandro Tlaie, Kyle O’Brien, Stephen Casper, and Aidan Homewood. Expanding external access to frontier AI models for dangerous capability evaluations.arXiv preprint arXiv:2601.11916, January 2026. 11 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

work page arXiv 2026

[23] [23]

Verifying international agreements on AI: Six layers of verification for rules on large-scale AI development and deployment.arXiv preprint arXiv:2507.15916, July 2025

Mauricio Baker, Gabriel Kulp, Oliver Marks, Miles Brundage, and Lennart Heim. Verifying international agreements on AI: Six layers of verification for rules on large-scale AI development and deployment.arXiv preprint arXiv:2507.15916, July 2025

work page arXiv 2025

[24] [24]

Auditing large language models: A three-layered approach.AI and Ethics, 4(4):1085–1115, 2024

Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, and Luciano Floridi. Auditing large language models: A three-layered approach.AI and Ethics, 4(4):1085–1115, 2024

2024

[25] [25]

A structured approach to safety case construction for AI systems.arXiv preprint arXiv:2601.22773, 2026

Sung Une Lee, Liming Zhu, Md Shamsujjoha, Liming Dong, Qinghua Lu, Jieshan Chen, and Lionel Briand. A structured approach to safety case construction for AI systems.arXiv preprint arXiv:2601.22773, 2026

work page arXiv 2026

[26] [26]

Kerry et al

Cameron F. Kerry et al. Is AI sovereignty possible? lessons from semiconductor and cloud policy. Technical report, Brookings Institution, February 2026

2026

[27] [27]

AI compute sovereignty: Infrastructure control across territories, cloud providers, and accelerators.SSRN Working Paper, June 2025

Zoe Jay Hawkins, Vili Lehdonvirta, and Boxi Wu. AI compute sovereignty: Infrastructure control across territories, cloud providers, and accelerators.SSRN Working Paper, June 2025

2025

[28] [28]

Lennart Heim. Understanding the artificial intelligence diffusion framework: Can export controls create a U.S.-led global artificial intelligence ecosystem? RAND Perspective PEA3776-1, RAND Corporation, January 2025

2025

[29] [29]

The sovereign AI agenda: Tech:forward survey

McKinsey & Company. The sovereign AI agenda: Tech:forward survey. Technical report, McKinsey, December 2025

2025

[30] [30]

Carl D. Liggio. The expectation gap: The accountant’s legal Waterloo.Journal of Contemporary Business, 3(3):27–44, 1974

1974

[31] [31]

An empirical study of the audit expectation-performance gap.Accounting and Business Research, 24(93):49–68, 1993

Brenda Porter. An empirical study of the audit expectation-performance gap.Accounting and Business Research, 24(93):49–68, 1993

1993

[32] [32]

The audit expectations gap in Britain: An empirical investigation.Accounting and Business Research, 23(sup1):395–411, 1993

Christopher Humphrey, Peter Moizer, and Stuart Turley. The audit expectations gap in Britain: An empirical investigation.Accounting and Business Research, 23(sup1):395–411, 1993

1993

[33] [33]

Audit expectation gap: Concept, nature and trace.African Journal of Business Management, 5(21):8376–8392, 2011

Mahdi Salehi. Audit expectation gap: Concept, nature and trace.African Journal of Business Management, 5(21):8376–8392, 2011

2011

[34] [34]

UK AISI alignment evaluation case-study.arXiv preprint arXiv:2604.00788, April 2026

Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D’Cruz, and Xander Davies. UK AISI alignment evaluation case-study.arXiv preprint arXiv:2604.00788, April 2026

work page arXiv 2026

[35] [35]

Time horizon 1.1

METR. Time horizon 1.1. METR Blog, January 2026

2026

[36] [36]

Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, September 2025

Bronson Schoen, Evgenia Nitishinskaya, Carson Denison, Alon Karpas, Kushal Tirumala, Boaz Wang, Tomek Korbak, Samuel Marks, Stephanie Lin, Catherine Olsson, Sara Riedel, Henry Sleight, Fabien Roger, Marius Hobbhahn, Mikita Balesni, et al. Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, September 2025

work page arXiv 2025

[37] [37]

Detecting and reducing scheming in AI models

OpenAI. Detecting and reducing scheming in AI models. OpenAI Blog (collaboration with Apollo Research), September 2025

2025

[38] [38]

Mechanistic interpretability for AI safety – a review.Transactions on Machine Learning Research (TMLR), 2024

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for AI safety – a review.Transactions on Machine Learning Research (TMLR), 2024

2024

[39] [39]

Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...

2025

[40] [40]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

2023

[41] [41]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Adam Jermyn, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

2024

[42] [42]

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

2025

[43] [43]

Sparse autoencoders for hypothesis generation.arXiv preprint arXiv:2502.04382, February 2025

Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, and Emma Pierson. Sparse autoencoders for hypothesis generation.arXiv preprint arXiv:2502.04382, February 2025

work page arXiv 2025

[44] [44]

Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research

Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Joseph Bloom, Neel Nanda, et al. Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research. DeepMind Safety Research, Alignment Forum, March 2025. 12 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

2025

[45] [45]

Are sparse autoencoders useful? a case study in sparse probing

Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. InProceedings of the 42nd International Conference on Machine Learning (ICML), pages 29018–29049, 2025

2025

[46] [46]

Manning, and Christopher Potts

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. AxBench: Steering LLMs? Even simple baselines outperform sparse autoencoders. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

2025

[47] [47]

Rogov, Ivan Oseledets, and Elena Tutubalina

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y . Rogov, Ivan Oseledets, and Elena Tutubalina. Sanity checks for sparse autoencoders: Do SAEs beat random baselines?arXiv preprint arXiv:2602.14111, February 2026

work page arXiv 2026

[48] [48]

When the coffee feature activates on coffins: An analysis of feature extraction and steering for mechanistic interpretability.arXiv preprint arXiv:2601.03047, January 2026

Raphael Ronge, Markus Maier, and Frederick Eberhardt. When the coffee feature activates on coffins: An analysis of feature extraction and steering for mechanistic interpretability.arXiv preprint arXiv:2601.03047, January 2026

work page arXiv 2026

[49] [49]

The secret agenda: LLMs strategically lie and our current safety tools are blind.arXiv preprint arXiv:2509.20393, September 2025

Caleb DeLeeuw, Gaurav Chawla, Aniket Sharma, and Vanessa Dietze. The secret agenda: LLMs strategically lie and our current safety tools are blind.arXiv preprint arXiv:2509.20393, September 2025

work page arXiv 2025

[50] [50]

The urgency of interpretability

Dario Amodei. The urgency of interpretability. Anthropic, April 2025

2025

[51] [51]

Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, March 2025

Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, et al. Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, March 2025

work page arXiv 2025

[52] [52]

Bowman, Sara Price, Samuel Marks, and Rowan Wang

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, and Rowan Wang. AuditBench: Evaluating alignment auditing techniques on models with hidden behaviors.arXiv preprint arXiv:2602.22755, March 2026

work page arXiv 2026

[53] [53]

Bowman, Trenton Bricken, Alex Cloud, Misha Wagner, Rowan Wang, Evan Hubinger, Fabien Roger, and Samuel Marks

Johannes Treutlein, Samuel R. Bowman, Trenton Bricken, Alex Cloud, Misha Wagner, Rowan Wang, Evan Hubinger, Fabien Roger, and Samuel Marks. Pre-deployment auditing can catch an overt saboteur. Anthropic Alignment Science Blog, January 2026

2026

[54] [54]

Farley and Christian R

Edwin A. Farley and Christian R. Lansang. AI auditing: First steps towards the effective regulation of artificial intelligence systems.Harvard Journal of Law & Technology, Digest, February 2025

2025

[55] [55]

Conformity assessments under the EU AI Act: A practical guide, 2025

Future of Privacy Forum and OneTrust. Conformity assessments under the EU AI Act: A practical guide, 2025

2025

[56] [56]

International AI safety report 2026

Yoshua Bengio et al. International AI safety report 2026. Technical report, Commissioned by the International AI Safety Report Secretariat, February 2026

2026

[57] [57]

Singer, Ruth E

Rishi Bommasani, Scott R. Singer, Ruth E. Appel, Sarah Cen, A. Feder Cooper, Elena Cryst, Lindsey A. Gailmard, Ian Klaus, Meredith M. Lee, Inioluwa Deborah Raji, Anka Reuel, Drew Spence, Alexander Wan, Angelina Wang, Daniel Zhang, Daniel E. Ho, Percy Liang, Dawn Song, Joseph E. Gonzalez, Jonathan Zittrain, Jennifer Tour Chayes, Mariano-Florentino Cuéllar,...

2025

[58] [58]

2025 AI safety index, 2025

Future of Life Institute. 2025 AI safety index, 2025

2025

[59] [59]

Model AI governance framework for generative AI, 2024

Infocomm Media Development Authority. Model AI governance framework for generative AI, 2024

2024

[60] [60]

AI Verify: Testing framework and toolkit, 2024

AI Verify Foundation. AI Verify: Testing framework and toolkit, 2024

2024

[61] [61]

Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025

Qingyu Yin, Chak Tou Leong, Wenxuan Huang, Wenjie Li, Linyi Yang, Xiting Wang, Jaehong Yoon, YunXing, XingYu, and Jinjin Gu. Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025

work page arXiv 2025

[62] [62]

Detecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. Detecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

work page arXiv 2025

[63] [63]

AI governance must move from ‘point-in-time’ audits to ‘living’ compliance

Vikram Singh. AI governance must move from ‘point-in-time’ audits to ‘living’ compliance. LSE Business Review, February 2026

2026

[64] [64]

A sketch of an AI control safety case.arXiv preprint arXiv:2501.17315, January 2025

Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving. A sketch of an AI control safety case.arXiv preprint arXiv:2501.17315, January 2025

work page arXiv 2025

[65] [65]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

Linear explanations for individual neurons

Tuomas Oikarinen and Tsui-Wei Weng. Linear explanations for individual neurons. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[67] [67]

Analyzing the structure of attention in a transformer language model

Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, 2019

2019

[68] [68]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Kevin R. Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small.arXiv preprint arXiv:2211.00593, 2022. 13 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

work page internal anchor Pith review Pith/arXiv arXiv 2022

[69] [69]

Beresford

Christoph Schnabl, Daniel Hugenroth, Bill Marino, and Alastair R. Beresford. Attestable audits: Verifiable AI safety benchmarks using trusted execution environments.arXiv preprint arXiv:2506.23706, June 2025

work page arXiv 2025

[70] [70]

Thomas Hou, and Wenjing Lou

Heng Jin, Chaoyu Zhang, Hexuan Yu, Shanghao Shi, Ning Zhang, Y . Thomas Hou, and Wenjing Lou. Trusting what you cannot see: Auditable fine-tuning and inference for proprietary AI.arXiv preprint arXiv:2603.07466, March 2026

work page arXiv 2026

[71] [71]

Beyond privacy: Structured transparency through secure enclaves

Andrew Trask et al. Beyond privacy: Structured transparency through secure enclaves. Working paper, 2024

2024

[72] [72]

Zero-knowledge proof protocols for ML model verification

Sakda Waiwitlikhit et al. Zero-knowledge proof protocols for ML model verification. Working paper, 2024

2024

[73] [73]

General-purpose AI code of practice (final).https://digital-strategy.ec.europa

European Commission. General-purpose AI code of practice (final).https://digital-strategy.ec.europa. eu/, July 2025

2025

[74] [74]

Artificial intelligence risk management framework (AI RMF 1.0)

National Institute of Standards and Technology. Artificial intelligence risk management framework (AI RMF 1.0). Technical Report AI 100-1, NIST, January 2023

2023

[75] [75]

Artificial intelligence risk management framework: Generative AI profile

National Institute of Standards and Technology. Artificial intelligence risk management framework: Generative AI profile. Technical Report AI 600-1, NIST, July 2024

2024

[76] [76]

V oluntary code of conduct on the responsible develop- ment and management of advanced generative AI systems, 2023

Innovation, Science and Economic Development Canada. V oluntary code of conduct on the responsible develop- ment and management of advanced generative AI systems, 2023

2023

[77] [77]

V oluntary AI safety standard, 2024

Department of Industry, Science and Resources (Australia). V oluntary AI safety standard, 2024

2024

[78] [78]

AI guidelines for business, version 1.0, 2024

Ministry of Economy, Trade and Industry (Japan). AI guidelines for business, version 1.0, 2024

2024

[79] [79]

Act on the development of artificial intelligence and establishment of a foundation for trust (AI Basic Act), December 2024

National Assembly of the Republic of Korea. Act on the development of artificial intelligence and establishment of a foundation for trust (AI Basic Act), December 2024

2024

[80] [80]

AI governance guidelines, February 2026

Ministry of Electronics and Information Technology (India). AI governance guidelines, February 2026

2026