Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
Pith reviewed 2026-06-30 20:50 UTC · model grok-4.3
The pith
Behavioural assurance cannot verify the unobservable safety properties that current AI governance frameworks require.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Behavioural assurance, even when carefully designed, is epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours that governance frameworks presume to regulate; this structural mismatch is formalised as the audit gap, with current methodologies producing fragile assurance in which the evidential structure fails to support the asserted safety claims.
What carries the argument
The audit gap: the divergence between the verification access required by governance frameworks and the verification access achievable by behavioural methods.
If this is right
- Governance frameworks over-reach the epistemic limits of behavioural assurance and therefore rest on fragile evidence.
- Incentive structures in industry and geopolitics systematically favour observable proxies over structural verification.
- Legal text should explicitly bound the weight given to behavioural evidence.
- Pre-deployment access should be extended to include mechanistic-evidence classes such as linear probes, activation patching, and before/after-training comparisons.
Where Pith is reading between the lines
- If the audit gap holds, regulators may need to narrow the safety properties they attempt to verify to those that remain observable.
- Adoption of mechanistic methods could shift assurance practice from post-training testing toward inspection of internal representations during development.
- The distinction between behavioural and mechanistic evidence may become a new axis in international AI standards discussions.
Load-bearing premise
Governance frameworks actually require verifiable evidence of unobservable internal properties rather than merely observable behavioral outputs.
What would settle it
A demonstration that behavioural red-teaming or evaluations alone can reliably confirm the absence of hidden objectives or resistance to long-horizon loss-of-control in a model where mechanistic methods later reveal such features.
read the original abstract
This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper claims that behavioural assurance methods (evaluations, red-teaming) are epistemically limited to observable outputs and cannot verify the latent properties (absence of hidden objectives, resistance to long-horizon loss-of-control) that AI governance frameworks enacted 2019–early 2026 are said to require. It formalises the resulting 'audit gap' and 'fragile assurance', reports a high-level analysis of a 21-instrument inventory that identifies an incentive gradient favouring surface proxies, and proposes bounding the legal weight of behavioural evidence while extending access for mechanistic techniques such as linear probes and activation patching.
Significance. If the central premise holds, the paper identifies a structural mismatch between regulatory demands and current verification capabilities that could systematically produce over-claimed safety assurances; the proposed pivot toward mechanistic evidence classes would be a concrete policy-technical response. The manuscript supplies no quantitative data, derivations, or machine-checked results, so its contribution is conceptual and rests entirely on the accuracy of the 21-instrument reading.
major comments (2)
- [21-instrument inventory analysis] The 21-instrument inventory analysis (described in the abstract and the section presenting the inventory): the claim that the frameworks 'require reviewable evidence of properties such as the absence of hidden objectives' is load-bearing for the audit-gap and fragile-assurance definitions, yet the manuscript supplies no quoted statutory or guidance language, no categorization table, and no explicit mapping showing which instruments use terms that exceed observable behavioural proxies. Without this, the mismatch remains an interpretive assertion rather than a demonstrated divergence.
- [formalisation of audit gap] Definition of the audit gap (early formalisation section): the gap is defined as the divergence between 'required' and 'achievable' verification access; if the 21 instruments in fact demand only observable risk assessments, red-teaming outputs, or capability benchmarks (as the skeptic note flags), the gap is empty by construction and the subsequent incentive-gradient claim loses its target.
minor comments (1)
- [proposal section] The abstract and proposal section refer to 'linear probes, activation patching, and before/after-training comparisons' without indicating whether these are intended as mandatory or merely voluntary supplements; a brief clarification of scope would improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful and substantive review. We agree that the 21-instrument analysis is presented at too high a level and that the load-bearing interpretive step requires explicit support. We will revise the manuscript to include the requested mappings, quotes, and categorization. Responses to the major comments follow.
read point-by-point responses
-
Referee: [21-instrument inventory analysis] The 21-instrument inventory analysis (described in the abstract and the section presenting the inventory): the claim that the frameworks 'require reviewable evidence of properties such as the absence of hidden objectives' is load-bearing for the audit-gap and fragile-assurance definitions, yet the manuscript supplies no quoted statutory or guidance language, no categorization table, and no explicit mapping showing which instruments use terms that exceed observable behavioural proxies. Without this, the mismatch remains an interpretive assertion rather than a demonstrated divergence.
Authors: We accept the point. The manuscript currently summarizes the inventory findings without supplying the underlying statutory language or a mapping table. In revision we will add an appendix containing a table that lists each of the 21 instruments, quotes the relevant passages concerning required evidence, and classifies each passage according to whether it refers only to observable behavioural outputs or also to latent properties (e.g., absence of hidden objectives or long-horizon loss-of-control resistance). This will convert the interpretive claim into an explicit, checkable mapping. revision: yes
-
Referee: [formalisation of audit gap] Definition of the audit gap (early formalisation section): the gap is defined as the divergence between 'required' and 'achievable' verification access; if the 21 instruments in fact demand only observable risk assessments, red-teaming outputs, or capability benchmarks (as the skeptic note flags), the gap is empty by construction and the subsequent incentive-gradient claim loses its target.
Authors: The audit-gap definition is deliberately conditional on the claim that the instruments require evidence of latent properties. We will revise the formalisation section to cross-reference the new mapping table, thereby making the divergence between required and achievable verification explicit rather than asserted. Should the added quotations show that all instruments are limited to observable outputs, we would retract the gap claim; our current reading, however, identifies language in multiple instruments that exceeds observable proxies. revision: partial
Circularity Check
No significant circularity; definitional argument with no equations, fits, or load-bearing self-citations
full rationale
The paper is a position piece that defines the 'audit gap' and 'fragile assurance' from an asserted mismatch between governance requirements (unobservables such as hidden objectives) and behavioral methods' observable limits, then analyzes a 21-instrument inventory to identify an incentive gradient. No mathematical derivations, fitted parameters, or equations exist. No self-citations are invoked as load-bearing premises, and the central claim does not reduce by construction to its own inputs or prior author work. The argument is interpretive and definitional rather than circular; any weakness lies in evidentiary support for the regulatory-requirement premise, not in self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Governance frameworks require verifiable evidence of latent properties (hidden objectives, long-horizon behaviors) rather than only observable outputs.
- domain assumption Behavioral evaluations and red-teaming are epistemically limited to observable model outputs.
invented entities (2)
-
audit gap
no independent evidence
-
fragile assurance
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Regulation (EU) 2024/1689 on harmonised rules on artificial intelligence (AI act), June 2024
European Parliament and Council. Regulation (EU) 2024/1689 on harmonised rules on artificial intelligence (AI act), June 2024
2024
-
[2]
Senate bill 53: Transparency in frontier artificial intelligence act (TFAIA), September 2025
State of California. Senate bill 53: Transparency in frontier artificial intelligence act (TFAIA), September 2025
2025
-
[3]
Framework convention on artificial intelligence and human rights, democracy and the rule of law, September 2024
Council of Europe. Framework convention on artificial intelligence and human rights, democracy and the rule of law, September 2024
2024
-
[4]
Black-box access is insufficient for rigorous AI audits
Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. Black-box access is insu...
2024
-
[5]
Model evaluation for extreme risks.arXiv preprint arXiv:2305.15324, 2023
Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokota- jlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation for extreme risks.arXiv pre...
-
[6]
Marie Davidsen Buhl, Gaurav Sett, Leonie Koessler, Jonas Schuett, and Markus Anderljung. Safety cases for frontier AI. arXiv:2410.21572, 2024
-
[7]
Toy models of superposition.Transformer Circuits Thread, 2022
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022
2022
-
[8]
Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K
Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Sören Mindermann, Evan Hubinger, Ethan Perez, and Kevin K. Troy. Agentic misalignment: How LLMs could be insider threats.arXiv preprint arXiv:2510.05179, October 2025
-
[9]
Year in review 2025, 2025
UK AI Security Institute. Year in review 2025, 2025
2025
-
[10]
Ziegler, Elizabeth Barnes, and Lawrence Chan
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence...
-
[11]
Frontier Models are Capable of In-context Scheming
Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, and Adam Gleave. STACK: Adversarial attacks on LLM safeguard pipelines.arXiv preprint arXiv:2506.24068, 2025
-
[13]
Senate bill 1047: Safe and secure innovation for frontier artificial intelligence models act (vetoed), 2024
State of California. Senate bill 1047: Safe and secure innovation for frontier artificial intelligence models act (vetoed), 2024
2024
-
[14]
Frontier compliance framework (SB-53), December 2025
Anthropic. Frontier compliance framework (SB-53), December 2025
2025
-
[15]
From surveillance to signalling: escalation channels as environmental controls for agentic AI
Francesca Gomez. Adapting insider risk mitigations for agentic misalignment: An empirical study.arXiv preprint arXiv:2510.05192, October 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Safety-case research programme
UK AI Security Institute. Safety-case research programme. Technical report, UK AISI, 2025
2025
-
[17]
A pragmatic vision for interpretability
Neel Nanda et al. A pragmatic vision for interpretability. Alignment Forum, December 2025
2025
-
[18]
An approach to technical AGI safety and security
Rohin Shah et al. An approach to technical AGI safety and security. Google DeepMind, 2025
2025
-
[19]
Structured access: An emerging paradigm for safe AI deployment
Toby Shevlane. Structured access: An emerging paradigm for safe AI deployment. Centre for the Governance of AI working paper, 2022
2022
-
[20]
Kochenderfer, and Robert Trager
Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, Neel Guha,...
2025
-
[21]
Bucknall and Robert F
Benjamin S. Bucknall and Robert F. Trager. Structured access for third-party research on frontier AI models. GovAI working paper, 2023
2023
-
[22]
Jacob Charnock, Alejandro Tlaie, Kyle O’Brien, Stephen Casper, and Aidan Homewood. Expanding external access to frontier AI models for dangerous capability evaluations.arXiv preprint arXiv:2601.11916, January 2026. 11 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
-
[23]
Mauricio Baker, Gabriel Kulp, Oliver Marks, Miles Brundage, and Lennart Heim. Verifying international agreements on AI: Six layers of verification for rules on large-scale AI development and deployment.arXiv preprint arXiv:2507.15916, July 2025
-
[24]
Auditing large language models: A three-layered approach.AI and Ethics, 4(4):1085–1115, 2024
Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, and Luciano Floridi. Auditing large language models: A three-layered approach.AI and Ethics, 4(4):1085–1115, 2024
2024
-
[25]
Sung Une Lee, Liming Zhu, Md Shamsujjoha, Liming Dong, Qinghua Lu, Jieshan Chen, and Lionel Briand. A structured approach to safety case construction for AI systems.arXiv preprint arXiv:2601.22773, 2026
-
[26]
Kerry et al
Cameron F. Kerry et al. Is AI sovereignty possible? lessons from semiconductor and cloud policy. Technical report, Brookings Institution, February 2026
2026
-
[27]
AI compute sovereignty: Infrastructure control across territories, cloud providers, and accelerators.SSRN Working Paper, June 2025
Zoe Jay Hawkins, Vili Lehdonvirta, and Boxi Wu. AI compute sovereignty: Infrastructure control across territories, cloud providers, and accelerators.SSRN Working Paper, June 2025
2025
-
[28]
Lennart Heim. Understanding the artificial intelligence diffusion framework: Can export controls create a U.S.-led global artificial intelligence ecosystem? RAND Perspective PEA3776-1, RAND Corporation, January 2025
2025
-
[29]
The sovereign AI agenda: Tech:forward survey
McKinsey & Company. The sovereign AI agenda: Tech:forward survey. Technical report, McKinsey, December 2025
2025
-
[30]
Carl D. Liggio. The expectation gap: The accountant’s legal Waterloo.Journal of Contemporary Business, 3(3):27–44, 1974
1974
-
[31]
An empirical study of the audit expectation-performance gap.Accounting and Business Research, 24(93):49–68, 1993
Brenda Porter. An empirical study of the audit expectation-performance gap.Accounting and Business Research, 24(93):49–68, 1993
1993
-
[32]
The audit expectations gap in Britain: An empirical investigation.Accounting and Business Research, 23(sup1):395–411, 1993
Christopher Humphrey, Peter Moizer, and Stuart Turley. The audit expectations gap in Britain: An empirical investigation.Accounting and Business Research, 23(sup1):395–411, 1993
1993
-
[33]
Audit expectation gap: Concept, nature and trace.African Journal of Business Management, 5(21):8376–8392, 2011
Mahdi Salehi. Audit expectation gap: Concept, nature and trace.African Journal of Business Management, 5(21):8376–8392, 2011
2011
-
[34]
UK AISI alignment evaluation case-study.arXiv preprint arXiv:2604.00788, April 2026
Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D’Cruz, and Xander Davies. UK AISI alignment evaluation case-study.arXiv preprint arXiv:2604.00788, April 2026
-
[35]
Time horizon 1.1
METR. Time horizon 1.1. METR Blog, January 2026
2026
-
[36]
Bronson Schoen, Evgenia Nitishinskaya, Carson Denison, Alon Karpas, Kushal Tirumala, Boaz Wang, Tomek Korbak, Samuel Marks, Stephanie Lin, Catherine Olsson, Sara Riedel, Henry Sleight, Fabien Roger, Marius Hobbhahn, Mikita Balesni, et al. Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, September 2025
-
[37]
Detecting and reducing scheming in AI models
OpenAI. Detecting and reducing scheming in AI models. OpenAI Blog (collaboration with Apollo Research), September 2025
2025
-
[38]
Mechanistic interpretability for AI safety – a review.Transactions on Machine Learning Research (TMLR), 2024
Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for AI safety – a review.Transactions on Machine Learning Research (TMLR), 2024
2024
-
[39]
Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath
Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...
2025
-
[40]
Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023
2023
-
[41]
Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Adam Jermyn, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024
2024
-
[42]
Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...
2025
-
[43]
Sparse autoencoders for hypothesis generation.arXiv preprint arXiv:2502.04382, February 2025
Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, and Emma Pierson. Sparse autoencoders for hypothesis generation.arXiv preprint arXiv:2502.04382, February 2025
-
[44]
Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research
Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Joseph Bloom, Neel Nanda, et al. Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research. DeepMind Safety Research, Alignment Forum, March 2025. 12 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
2025
-
[45]
Are sparse autoencoders useful? a case study in sparse probing
Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. InProceedings of the 42nd International Conference on Machine Learning (ICML), pages 29018–29049, 2025
2025
-
[46]
Manning, and Christopher Potts
Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. AxBench: Steering LLMs? Even simple baselines outperform sparse autoencoders. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025
2025
-
[47]
Rogov, Ivan Oseledets, and Elena Tutubalina
Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y . Rogov, Ivan Oseledets, and Elena Tutubalina. Sanity checks for sparse autoencoders: Do SAEs beat random baselines?arXiv preprint arXiv:2602.14111, February 2026
-
[48]
Raphael Ronge, Markus Maier, and Frederick Eberhardt. When the coffee feature activates on coffins: An analysis of feature extraction and steering for mechanistic interpretability.arXiv preprint arXiv:2601.03047, January 2026
-
[49]
Caleb DeLeeuw, Gaurav Chawla, Aniket Sharma, and Vanessa Dietze. The secret agenda: LLMs strategically lie and our current safety tools are blind.arXiv preprint arXiv:2509.20393, September 2025
-
[50]
The urgency of interpretability
Dario Amodei. The urgency of interpretability. Anthropic, April 2025
2025
-
[51]
Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, March 2025
Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, et al. Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, March 2025
-
[52]
Bowman, Sara Price, Samuel Marks, and Rowan Wang
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, and Rowan Wang. AuditBench: Evaluating alignment auditing techniques on models with hidden behaviors.arXiv preprint arXiv:2602.22755, March 2026
-
[53]
Bowman, Trenton Bricken, Alex Cloud, Misha Wagner, Rowan Wang, Evan Hubinger, Fabien Roger, and Samuel Marks
Johannes Treutlein, Samuel R. Bowman, Trenton Bricken, Alex Cloud, Misha Wagner, Rowan Wang, Evan Hubinger, Fabien Roger, and Samuel Marks. Pre-deployment auditing can catch an overt saboteur. Anthropic Alignment Science Blog, January 2026
2026
-
[54]
Farley and Christian R
Edwin A. Farley and Christian R. Lansang. AI auditing: First steps towards the effective regulation of artificial intelligence systems.Harvard Journal of Law & Technology, Digest, February 2025
2025
-
[55]
Conformity assessments under the EU AI Act: A practical guide, 2025
Future of Privacy Forum and OneTrust. Conformity assessments under the EU AI Act: A practical guide, 2025
2025
-
[56]
International AI safety report 2026
Yoshua Bengio et al. International AI safety report 2026. Technical report, Commissioned by the International AI Safety Report Secretariat, February 2026
2026
-
[57]
Singer, Ruth E
Rishi Bommasani, Scott R. Singer, Ruth E. Appel, Sarah Cen, A. Feder Cooper, Elena Cryst, Lindsey A. Gailmard, Ian Klaus, Meredith M. Lee, Inioluwa Deborah Raji, Anka Reuel, Drew Spence, Alexander Wan, Angelina Wang, Daniel Zhang, Daniel E. Ho, Percy Liang, Dawn Song, Joseph E. Gonzalez, Jonathan Zittrain, Jennifer Tour Chayes, Mariano-Florentino Cuéllar,...
2025
-
[58]
2025 AI safety index, 2025
Future of Life Institute. 2025 AI safety index, 2025
2025
-
[59]
Model AI governance framework for generative AI, 2024
Infocomm Media Development Authority. Model AI governance framework for generative AI, 2024
2024
-
[60]
AI Verify: Testing framework and toolkit, 2024
AI Verify Foundation. AI Verify: Testing framework and toolkit, 2024
2024
-
[61]
Qingyu Yin, Chak Tou Leong, Wenxuan Huang, Wenjie Li, Linyi Yang, Xiting Wang, Jaehong Yoon, YunXing, XingYu, and Jinjin Gu. Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025
-
[62]
Detecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025
Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. Detecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025
-
[63]
AI governance must move from ‘point-in-time’ audits to ‘living’ compliance
Vikram Singh. AI governance must move from ‘point-in-time’ audits to ‘living’ compliance. LSE Business Review, February 2026
2026
-
[64]
A sketch of an AI control safety case.arXiv preprint arXiv:2501.17315, January 2025
Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving. A sketch of an AI control safety case.arXiv preprint arXiv:2501.17315, January 2025
-
[65]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Linear explanations for individual neurons
Tuomas Oikarinen and Tsui-Wei Weng. Linear explanations for individual neurons. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[67]
Analyzing the structure of attention in a transformer language model
Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, 2019
2019
-
[68]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin R. Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small.arXiv preprint arXiv:2211.00593, 2022. 13 Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [69]
-
[70]
Heng Jin, Chaoyu Zhang, Hexuan Yu, Shanghao Shi, Ning Zhang, Y . Thomas Hou, and Wenjing Lou. Trusting what you cannot see: Auditable fine-tuning and inference for proprietary AI.arXiv preprint arXiv:2603.07466, March 2026
-
[71]
Beyond privacy: Structured transparency through secure enclaves
Andrew Trask et al. Beyond privacy: Structured transparency through secure enclaves. Working paper, 2024
2024
-
[72]
Zero-knowledge proof protocols for ML model verification
Sakda Waiwitlikhit et al. Zero-knowledge proof protocols for ML model verification. Working paper, 2024
2024
-
[73]
General-purpose AI code of practice (final).https://digital-strategy.ec.europa
European Commission. General-purpose AI code of practice (final).https://digital-strategy.ec.europa. eu/, July 2025
2025
-
[74]
Artificial intelligence risk management framework (AI RMF 1.0)
National Institute of Standards and Technology. Artificial intelligence risk management framework (AI RMF 1.0). Technical Report AI 100-1, NIST, January 2023
2023
-
[75]
Artificial intelligence risk management framework: Generative AI profile
National Institute of Standards and Technology. Artificial intelligence risk management framework: Generative AI profile. Technical Report AI 600-1, NIST, July 2024
2024
-
[76]
V oluntary code of conduct on the responsible develop- ment and management of advanced generative AI systems, 2023
Innovation, Science and Economic Development Canada. V oluntary code of conduct on the responsible develop- ment and management of advanced generative AI systems, 2023
2023
-
[77]
V oluntary AI safety standard, 2024
Department of Industry, Science and Resources (Australia). V oluntary AI safety standard, 2024
2024
-
[78]
AI guidelines for business, version 1.0, 2024
Ministry of Economy, Trade and Industry (Japan). AI guidelines for business, version 1.0, 2024
2024
-
[79]
Act on the development of artificial intelligence and establishment of a foundation for trust (AI Basic Act), December 2024
National Assembly of the Republic of Korea. Act on the development of artificial intelligence and establishment of a foundation for trust (AI Basic Act), December 2024
2024
-
[80]
AI governance guidelines, February 2026
Ministry of Electronics and Information Technology (India). AI governance guidelines, February 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.