Evaluating AI Providers' Frontier Safety Frameworks
Pith reviewed 2026-05-17 03:43 UTC · model grok-4.3
The pith
Twelve AI companies' safety frameworks score a median of 18% on risk management criteria drawn from aviation and nuclear industries, making them poor tools for external accountability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current frontier AI safety frameworks contain vague commitments that limit their effectiveness as external accountability instruments, as shown by low scores across sixty-five criteria adapted from high-risk industries, though a company incorporating all leading practices already present among peers could reach triple the current median score.
What carries the argument
Sixty-five weighted criteria grouped into risk identification, risk analysis and evaluation, risk treatment, and risk governance, adapted from Campos et al. (2025) and high-risk industry standards.
If this is right
- Vague language in the frameworks prevents outsiders from forecasting how companies will respond to specific advanced AI risks.
- Assessing whether a company's planned mitigation is adequate becomes unreliable without clearer commitments.
- Verifying whether companies have kept their own stated commitments is difficult under current wording.
- Adopting the strongest elements already used by any peer could raise a typical framework's score from the median of eighteen percent to fifty-four percent.
Where Pith is reading between the lines
- Legislators treating these frameworks as compliance documents may need to require more standardized and specific language.
- Companies could improve accountability scores quickly by copying the best current practices from one another rather than inventing new ones.
- Over time, repeated evaluations using the same criteria might pressure the industry toward clearer risk management norms.
Load-bearing premise
The chosen criteria correctly identify the essential elements needed for responsible management of catastrophic AI risks.
What would settle it
A follow-up study that tracks whether companies with higher-scoring frameworks actually take materially different or more effective actions when new risks emerge.
read the original abstract
Following the AI Seoul Summit in 2024, twelve AI companies published frontier AI safety frameworks (Frameworks) outlining their approaches to managing catastrophic risks from advanced AI systems. Emerging legislation increasingly treats these Frameworks as external accountability mechanisms, incorporating them into reporting requirements. But what do the Frameworks actually commit each company to do? This study assesses 12 Frameworks, using 65 weighted criteria, across four dimensions: risk identification, risk analysis \& evaluation, risk treatment, and risk governance. Our criteria adapt established risk management principles from other high-risk industries (e.g. aviation, nuclear power) to the frontier AI context, following Campos et al. (2025). Overall scores range from 34% (Anthropic) to 8% (Cohere), with a median of 18%. Many aspects are missing or under-specified. These low scores may be natural given the nascency of AI risk management compared to industries with decades of practice. Nonetheless, current Frameworks are limited as accountability functions, with vague commitments that make it difficult to predict company decisions, assess whether planned responses are adequate, or determine whether commitments have been kept. Still, higher scores appear feasible within current constraints: a company adopting all leading practices currently adopted across their peers would score 54%, which is triple the current median.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates the frontier AI safety frameworks published by twelve AI companies following the 2024 AI Seoul Summit. Using 65 weighted criteria adapted from Campos et al. (2025) and risk-management principles from high-risk industries (aviation, nuclear), it scores the frameworks across four dimensions: risk identification, risk analysis & evaluation, risk treatment, and risk governance. Aggregate scores range from 34% (Anthropic) to 8% (Cohere) with a median of 18%. The authors conclude that current frameworks are limited as accountability functions because of vague commitments that impede prediction of company decisions, assessment of response adequacy, and verification of compliance, while noting that adopting all leading practices already present across peers could reach 54%.
Significance. If the 65 criteria validly identify the elements required for predictability, adequacy assessment, and verifiability in frontier AI, the paper supplies a timely, structured benchmark that could inform emerging legislation treating these frameworks as accountability mechanisms. The work is strengthened by its explicit identification of a feasible improvement trajectory (54%) and by the systematic transfer of established risk-management principles to a new domain. The result would be more significant if the criteria were shown to be necessary rather than merely aspirational for a domain whose risks remain less amenable to quantitative historical validation than aviation or nuclear power.
major comments (2)
- [Methods] Methods section: the evaluation process is under-specified with respect to inter-rater reliability, the exact weighting procedure applied to the 65 criteria, and the handling of borderline cases. Without these details the link between raw framework text and the reported aggregate percentages (e.g., 34% for Anthropic) cannot be fully assessed.
- [Criteria development and Discussion] Criteria development and Discussion sections: the manuscript adapts criteria from Campos et al. (2025) plus aviation/nuclear principles and notes that low scores may be expected given the nascency of AI risk management, yet still infers that the frameworks are limited as accountability functions. It does not demonstrate that the absent elements (specific quantitative risk metrics, treatment thresholds) are required for accountability rather than premature for frontier AI, where risks lack the historical data available in the source domains.
minor comments (2)
- [Abstract and Results] The abstract and results tables would benefit from a brief statement of how many frameworks were scored by how many raters and whether any inter-rater agreement statistic was computed.
- [References] Ensure the full bibliographic details and status (preprint vs. published) of Campos et al. (2025) appear in the reference list.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the methodological transparency and the scope of our claims about accountability. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Methods] Methods section: the evaluation process is under-specified with respect to inter-rater reliability, the exact weighting procedure applied to the 65 criteria, and the handling of borderline cases. Without these details the link between raw framework text and the reported aggregate percentages (e.g., 34% for Anthropic) cannot be fully assessed.
Authors: We agree that the Methods section requires greater specificity to allow full assessment of the scoring process. In the revised manuscript we will add: (1) details on inter-rater reliability, noting that two authors independently evaluated each framework against the criteria, with disagreements resolved by consensus and an inter-rater agreement rate reported; (2) the exact weighting procedure, explaining that weights were assigned proportionally to the emphasis in Campos et al. (2025) and standard risk-management frameworks, with equal weighting applied within each of the four dimensions; and (3) handling of borderline cases, including explicit examples of how ambiguous or conditional language in the frameworks was coded (typically conservatively as not fully meeting the criterion). These additions will make the connection between source text and aggregate scores transparent. revision: yes
-
Referee: [Criteria development and Discussion] Criteria development and Discussion sections: the manuscript adapts criteria from Campos et al. (2025) plus aviation/nuclear principles and notes that low scores may be expected given the nascency of AI risk management, yet still infers that the frameworks are limited as accountability functions. It does not demonstrate that the absent elements (specific quantitative risk metrics, treatment thresholds) are required for accountability rather than premature for frontier AI, where risks lack the historical data available in the source domains.
Authors: We appreciate the referee's distinction between aspirational and necessary elements. Our core claim is that accountability functions—predicting decisions, assessing response adequacy, and verifying compliance—logically require commitments that are specific enough to be evaluated, regardless of domain-specific data availability. We ground this in the risk-management literature rather than claiming empirical necessity from historical AI incidents. In the revised Discussion we will (a) more explicitly separate the logical requirements for accountability from domain-specific implementation details, (b) acknowledge that quantitative metrics may be premature and that clear qualitative thresholds could suffice initially, and (c) note that the observed vagueness prevents even basic accountability functions today. We will not overstate the criteria as universally required but will strengthen the argument that current frameworks fall short of the minimum specificity needed for the accountability role assigned by emerging legislation. revision: partial
Circularity Check
No significant circularity: original scoring of 12 frameworks is independent of criteria source
full rationale
The paper performs an empirical evaluation by applying 65 weighted criteria to 12 newly published frontier AI safety frameworks. The criteria are adapted from high-risk industry principles following Campos et al. (2025), with author overlap noted, but this is a standard methodological citation rather than a load-bearing reduction. No equations, fitted parameters presented as predictions, self-definitional loops, or uniqueness theorems are present. The central result—the specific scores, median of 18%, and comparison to a 54% peer-leading benchmark—is generated from direct analysis of the target documents and does not reduce to the prior work by construction. The paper explicitly acknowledges nascency effects, confirming the assessment is not tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- criteria weights
axioms (1)
- domain assumption Risk-management principles developed for aviation and nuclear power are applicable to frontier AI catastrophic-risk governance
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a 65-criteria assessment methodology grounded in established risk management principles from safety-critical industries... across four dimensions: risk identification, risk analysis and evaluation, risk treatment, and risk governance.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
No company explicitly and quantitatively states the maximum risk level they will impose on society.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
-
[4]
Emerging Practices in Frontier
Buhl, Marie Davidsen and Bucknall, Ben and Masterson, Tammy , journal=. Emerging Practices in Frontier
-
[5]
Alaga, Jide and Schuett, Jonas and Anderljung, Markus , journal=. A grading rubric for
- [6]
-
[7]
Updating Our Preparedness Framework , year =
-
[8]
Anthropic’s Responsible Scaling Policy , year =
- [9]
-
[10]
Artificial Intelligence Risk Management Framework (. 2023 , howpublished =
work page 2023
- [11]
-
[12]
Strengthening Our Frontier Safety Framework , year =
- [13]
-
[14]
International AI safety report.arXiv preprint arXiv:2501.17805,
Bengio, Yoshua and Mindermann, S. International. arXiv preprint arXiv:2501.17805 , year=
-
[15]
Bengio, Yoshua and Clare, Stephen and Prunkl, Carina and Andriushchenko, Maksym and Bucknall, Ben and Murray, Malcolm and Bommasani, Rishi and Casper, Stephen and Davidson, Tom and Douglas, Raymond and others , journal=. International
-
[16]
2025 IEEE 64th Conference on Decision and Control (CDC) , pages=
Nonlinear robust optimization for planning and control , author=. 2025 IEEE 64th Conference on Decision and Control (CDC) , pages=. 2025 , organization=
work page 2025
- [17]
- [18]
-
[19]
Effects of foggy conditions on drivers’ speed control behaviors at different risk levels , author=. Safety Science , volume=. 2014 , publisher=
work page 2014
-
[20]
Van Der Weij, Teun and Hofst. arXiv preprint arXiv:2406.07358 , year=
-
[21]
Safety cases: How to justify the safety of advanced
Clymer, Joshua and Gabrieli, Nick and Krueger, David and Larsen, Thomas , journal=. Safety cases: How to justify the safety of advanced
-
[22]
Towards evaluations-based safety cases for
Balesni, Mikita and Hobbhahn, Marius and Lindner, David and Meinke, Alexander and Korbak, Tomek and Clymer, Joshua and Shlegeris, Buck and Scheurer, J. Towards evaluations-based safety cases for. arXiv preprint arXiv:2411.03336 , year=
- [23]
-
[24]
SB 53 (Transparency in Frontier Artificial Intelligence Act) , year =
-
[25]
Academy of management journal , volume=
Corporate social performance and organizational attractiveness to prospective employees , author=. Academy of management journal , volume=. 1997 , publisher=
work page 1997
-
[26]
International Family Planning Perspectives , pages=
Sexual activity and contraceptive knowledge and use among in-school adolescents in Nigeria , author=. International Family Planning Perspectives , pages=. 1997 , publisher=
work page 1997
-
[27]
C. B. Bhattacharya and Sankar Sen and Daniel Korschun , title =. 2008 , howpublished =
work page 2008
-
[28]
2023 , howpublished =
work page 2023
-
[29]
Bengio, Yoshua and Hinton, Geoffrey and Yao, Andrew and Song, Dawn and Abbeel, Pieter and Darrell, Trevor and Harari, Yuval Noah and Zhang, Ya-Qin and Xue, Lan and Shalev-Shwartz, Shai and others , journal=. Managing extreme. 2024 , publisher=
work page 2024
-
[30]
Schuett, Jonas and Dreksler, Noemi and Anderljung, Markus and McCaffary, David and Heim, Lennart and Bluemke, Emma and Garfinkel, Ben , journal=. Towards best practices in
- [31]
- [32]
- [33]
-
[34]
arXiv preprint arXiv:2501.16500 , year=
Towards Frontier Safety Policies Plus , author=. arXiv preprint arXiv:2501.16500 , year=
- [35]
-
[36]
2025 , howpublished =
work page 2025
-
[37]
Risk Management Ratings (Legacy) , year =
-
[38]
Bill Anderson-Samways and Institute for AI Policy and Strategy , title =. 2026 , howpublished =
work page 2026
-
[39]
2026 , howpublished =
work page 2026
-
[40]
Is OpenAI’s Preparedness Framework better than its competitors’ “Responsible Scaling Policies”? A Comparative Analysis , year =
- [41]
-
[42]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned , author=. arXiv preprint arXiv:2209.07858 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [43]
- [44]
- [45]
- [46]
-
[47]
Risk Appetite and Tolerance , year =
- [48]
- [49]
- [50]
-
[51]
arXiv preprint arXiv:2503.05628 , year=
Superintelligence strategy: Expert version , author=. arXiv preprint arXiv:2503.05628 , year=
-
[52]
Marginal Risk Relative to What? Distinguishing Baselines in
Alaga, Jide and Chen, Michael , booktitle=. Marginal Risk Relative to What? Distinguishing Baselines in
-
[53]
Risk Taxonomy and Thresholds for Frontier. 2025 , howpublished =
work page 2025
-
[54]
SB-53 Artificial Intelligence Models: Large Developers (Transparency in Frontier Artificial Intelligence Act) , year =
-
[55]
Responsible Scaling Policy v3 , year =
-
[56]
Naming and Shaming of Corporate Offenders , year =
- [57]
-
[58]
Senate Bill 53: Artificial Intelligence Models: Large Developers , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.