Open Weight AI Models Require Proportional Evaluation Approaches

Christopher Rodriguez; Patricia Paskov; Stephen Casper; Sunishchal Dev

arxiv: 2606.19890 · v1 · pith:WIQRO5VVnew · submitted 2026-06-18 · 💻 cs.CY

Open Weight AI Models Require Proportional Evaluation Approaches

Patricia Paskov , Christopher Rodriguez , Sunishchal Dev , Stephen Casper This is my paper

Pith reviewed 2026-06-26 15:38 UTC · model grok-4.3

classification 💻 cs.CY

keywords open-weight modelsAI evaluationproportional evaluationmodel risksAI safetyopen source AIAI governance

0 comments

The pith

Open-weight AI models need four distinct evaluation methods that remove or bypass their safeguards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that open-weight models carry unique risks because anyone can modify or deploy them without the safety layers that closed models use. Existing evaluation practices, built for closed-weight models, do not test these risks. The authors propose four proportional evaluation approaches: testing models without system safeguards, checking resistance to changes that remove model safeguards, examining selective capability boosts, and using proxies for worst-case misuse scenarios. A review of 37 model families released through early 2026 finds that almost none apply all four and most apply none. This matters because open-weight releases are spreading quickly and matching closed-model performance.

Core claim

Open-weight models introduce distinct risk factors that standard evaluations fail to address, so developers must adopt proportional evaluation approaches: evaluating without system-level safeguards (PE1), assessing robustness to modifications that undo model-level safeguards (PE2), testing selective capability amplification (PE3), and proxying worst-case misuse (PE4). A systematic review shows only one of 37 model families fulfills all four, and most fulfill none.

What carries the argument

The four proportional evaluation (PE) approaches that target risks unique to public model weights rather than closed deployment.

If this is right

Most current evaluations of open-weight models are insufficient for their risk profile.
Developers releasing open weights should run tests that remove system safeguards and check for bypasses.
Funders and policymakers should require evidence of these four approaches before supporting open releases.
As open models approach closed-model performance, evaluation gaps will widen without these changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Policymakers might treat open-weight and closed-weight models under separate regulatory tracks.
Research could focus on cheap proxies for worst-case misuse that do not require full capability disclosure.
Open releases could slow if the added evaluation burden proves high.

Load-bearing premise

The four listed evaluation approaches are the necessary and proportional responses to the risks that come with releasing model weights publicly.

What would settle it

An empirical study showing that standard closed-model evaluations already identify the same risks for open-weight models, or that the four PE methods add no new information about misuse potential.

read the original abstract

Open-weight AI models (OWMs), or models released with publicly-available weights, are distributing rapidly and approaching the performance levels of leading closed-weight AI models (CWMs). While OWMs offer substantial scientific and economic benefits, their release introduces distinct risk factors for which existing evaluation practices, largely designed for CWM deployment, fail to account. In this paper, we argue that these risk factors demand distinct proportional evaluation (PE) approaches: evaluating without system-level safeguards (PE1), assessing robustness to modifications that undo model-level safeguards (PE2), testing selective capability amplification (PE3), and proxying worst-case misuse (PE4). We systematically review current evaluation practices of OWMs released in 2025 through April 2026, finding that only one of the 37 families of models reviewed fulfills PE1-4 and most do not fulfill any. This paper targets policymakers, funders, and researchers involved in AI evaluation. As OWMs grow increasingly capable, their evaluation warrants close attention from developers, funders, and governance bodies alike.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that open weights create distinct risks but asserts four specific evaluations are required without deriving or testing that mapping.

read the letter

The main thing to know is that this paper argues open-weight models (OWMs) have risk factors that closed-weight evaluations miss, so they need four new proportional evaluation approaches: testing without system safeguards (PE1), checking robustness to undoing model safeguards (PE2), assessing selective capability amplification (PE3), and proxying worst-case misuse (PE4). It then counts 37 model families from 2025 through April 2026 and finds only one meets all four while most meet none.

What is new is the explicit naming of those four criteria plus the systematic count across recent releases. The underlying observation that public weights change the misuse picture has been discussed before, but the concrete tally and the PE framing are fresh.

The paper does a service by directing attention to how release format affects evaluation needs, which matters for funders and governance bodies. That distinction is real and worth spelling out.

The soft spots are not minor. The abstract supplies no inclusion criteria, data sources, or checking procedure for the 37 families, so the compliance numbers cannot be assessed. More importantly, the claim that the risks "demand" exactly these four approaches is presented as a direct implication rather than derived from evidence. There is no step showing why each PE is necessary and proportional, why alternatives like licensing or monitoring would not suffice, or that meeting the criteria would actually lower the identified risks. The count therefore measures adherence to an unvalidated prescription.

This is for policymakers, funders, and evaluation researchers who care about open versus closed distinctions. It shows honest engagement with the policy literature even if the evidential base is light. I would bring it to a reading group to discuss evaluation standards.

It deserves peer review because the topic is timely and the count could become useful once methods are documented. I would recommend sending it out rather than desk rejecting.

Referee Report

2 major / 0 minor

Summary. The paper claims that open-weight AI models (OWMs) introduce distinct risk factors not addressed by evaluation practices designed for closed-weight models (CWMs). It argues that these risks demand four specific proportional evaluation (PE) approaches: PE1 (evaluating without system-level safeguards), PE2 (assessing robustness to modifications that undo model-level safeguards), PE3 (testing selective capability amplification), and PE4 (proxying worst-case misuse). A systematic review of 37 model families released in 2025 through April 2026 finds that only one fulfills all PE1-4 and most fulfill none. The paper is directed at policymakers, funders, and researchers in AI evaluation.

Significance. If the argument holds, the paper would highlight an important gap in AI evaluation practices as OWMs approach CWM performance levels, providing a concrete set of PE approaches that could shape governance discussions and research priorities. The timely focus on open-weight release risks is a strength, though the lack of a derived risk-to-mitigation mapping limits its immediate applicability.

major comments (2)

[Abstract] Abstract: The abstract states a review result covering 37 families but supplies no methodology, inclusion criteria, or data sources. This absence is load-bearing for the central empirical claim that most models fulfill none of the criteria, as it prevents assessment of whether the count is reproducible or representative.
[Introduction] Introduction/argument section: The claim that the identified risk factors 'demand' exactly these four PE approaches (PE1-PE4) is presented as a direct implication without an intervening derivation. There is no enumeration of risk factors with supporting evidence of distinctness, demonstration that each PE is necessary and proportional (versus alternatives such as licensing or usage monitoring), or test showing that satisfying PE1-PE4 would materially reduce the risks. This normative assertion underpins the entire recommendation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and transparency.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states a review result covering 37 families but supplies no methodology, inclusion criteria, or data sources. This absence is load-bearing for the central empirical claim that most models fulfill none of the criteria, as it prevents assessment of whether the count is reproducible or representative.

Authors: We agree that the abstract, as a standalone summary, should include a brief statement of the review methodology to support reproducibility of the central empirical finding. The full details appear in Section 3, but we will revise the abstract to add one sentence specifying the time window (models released 2025 through April 2026), primary sources (Hugging Face model hub releases and developer announcements), and inclusion criteria (publicly released weight checkpoints from distinct model families). revision: yes
Referee: [Introduction] Introduction/argument section: The claim that the identified risk factors 'demand' exactly these four PE approaches (PE1-PE4) is presented as a direct implication without an intervening derivation. There is no enumeration of risk factors with supporting evidence of distinctness, demonstration that each PE is necessary and proportional (versus alternatives such as licensing or usage monitoring), or test showing that satisfying PE1-PE4 would materially reduce the risks. This normative assertion underpins the entire recommendation.

Authors: Section 2 enumerates the four distinct risk factors unique to open weights and links each to the corresponding PE criterion, arguing proportionality on the basis of weight accessibility. We acknowledge, however, that an explicit derivation step (e.g., a mapping table contrasting the PEs with alternatives such as licensing) would make the normative claim more transparent. We will add such a subsection in the revised introduction while preserving the original argument structure. revision: yes

Circularity Check

0 steps flagged

No circularity; normative argument with independent compliance count

full rationale

The paper advances a policy argument that distinct OWM risk factors 'demand' four specific PE approaches (PE1–PE4), then counts compliance among 37 model families. No equations, fitted parameters, or self-referential reductions exist. The central claim is an explicit normative assertion rather than a derivation that collapses to its own inputs by construction. The review simply applies the authors' framework to external model releases; it does not rename a fit as a prediction or rely on load-bearing self-citations. This is a standard argumentative paper whose claims stand or fall on the strength of the risk-factor enumeration and proportionality reasoning, not on internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that open weights create qualitatively different risks and on the newly introduced PE1-4 concepts; no free parameters or external benchmarks are used.

axioms (1)

domain assumption Open-weight models introduce distinct risk factors for which existing evaluation practices fail to account
Stated directly in the abstract as the premise for demanding PE approaches.

invented entities (1)

Proportional Evaluation approaches (PE1-PE4) no independent evidence
purpose: To provide evaluation methods matched to open-weight release risks
Newly defined in the paper with no prior citation or independent validation mentioned.

pith-pipeline@v0.9.1-grok · 5710 in / 1281 out tokens · 22258 ms · 2026-06-26T15:38:05.433836+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 3 canonical work pages

[1]

2024 , isbn =

As of March 11, 2026: https://arxiv.org/pdf/2408.02946 Brundage, Miles, Noemi Dreksler, Aidan Homewood, Sean McGregor, Patricia Paskov, Conrad Stosz, Girish Sastry, A. Feder Cooper, George Balston, Steven Adler, Stephen Casper, Markus Anderljung, Grace Werner, Soren Mindermann, Vasilios Mavroudis, Ben Bucknall, Charlotte Stix, Jonas Freund, Lorenzo Pacchi...

work page doi:10.1145/3630106.3659037 2026
[2]

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety,

As of March 10, 2026: https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai 23 Featherless.ai, Goekdeniz-Guelmez/Josiefied-Qwen3-8B-abliterated-v1, undated. As of March 11, 2026: https://featherless.ai/models/Goekdeniz-Guelmez/Josiefied-Qwen3-8B-abliterated-v1 François, Camille, Ludovic Péran, Ayah Bdeir, Nouha Dziri, Will Hawkins, Yacine J...

arXiv 2026
[3]

Preliminary Reporting Tiers for AI Bio Safety Evaluations,

As of March 10, 2026: https://www.frontiermodelforum.org/technical-reports/frontier-capability-assessments Frontier Model Forum, “Preliminary Reporting Tiers for AI Bio Safety Evaluations,” March 18,

2026
[4]

Towards a Science of AI Evaluations,

As of January 13, 2026: https://www.frontiermodelforum.org/uploads/2025/03/PDF- Version-of-Preliminary-Reporting-Tiers.pdf Gal, Yarin, “Towards a Science of AI Evaluations,” Yarin Gal, blog, 2024. As of March 10, 2026: https://www.cs.ox.ac.uk/people/yarin.gal/website/blog_98A8.html Gal, Yarin, and Stephen Casper, “Customizable AI Systems That Anyone Can A...

work page doi:10.1038/s42256-025-00985-0 2026
[5]

Simple Probes Can Catch Sleeper Agents,

As of March 11, 2026: https://arxiv.org/abs/2405.19358 MacDiarmid, Monte, Tim Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Sam Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger, “Simple Probes Can Catch Sleeper Agents,” Anthropic, April 29, 2024. As of March 10, 2026: https://www.anthropic.com/resea...

arXiv 2026
[6]

The Science and Practice of Proportionality in AI Risk Evaluations,

As of March 11, 2026: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/fine-tuning-safety- evaluation?view=foundry-classic Miller, Kyle, Mia Hoffmann, and Rebecca Gelles, The Use of Open Models in Research, Center for Security and Emerging Technology (CSET), October 1, 2025. As of March 10, 2026: https://cset.georgetown.edu/publication/the-us...

work page doi:10.1126/science.aea3835 2026
[7]

Steering Llama 2 via Contrastive Activation Addition,

As of March 11, 2026: https://arxiv.org/abs/2505.16789 29 Panickssery, Nina, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner, “Steering Llama 2 via Contrastive Activation Addition,” arXiv, arXiv:2312.06681 July 5, 2024. As of March 11, 2026: https://arxiv.org/abs/2312.06681 Qi, Xiangyu, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi...

arXiv 2026
[8]

She conducts technical and policy research on the science of AI capability evaluations

As of March 10, 2026: https://arxiv.org/abs/2311.05553 33 About the Authors Patricia Paskov is an Oxford Martin AI Governance Research Affiliate and an Adjunct Researcher at RAND. She conducts technical and policy research on the science of AI capability evaluations. Paskov holds a M.Res. and M.Sc. in Economics. Christopher Rodriguez is a Biosecurity Rese...

arXiv 2026

[1] [1]

2024 , isbn =

As of March 11, 2026: https://arxiv.org/pdf/2408.02946 Brundage, Miles, Noemi Dreksler, Aidan Homewood, Sean McGregor, Patricia Paskov, Conrad Stosz, Girish Sastry, A. Feder Cooper, George Balston, Steven Adler, Stephen Casper, Markus Anderljung, Grace Werner, Soren Mindermann, Vasilios Mavroudis, Ben Bucknall, Charlotte Stix, Jonas Freund, Lorenzo Pacchi...

work page doi:10.1145/3630106.3659037 2026

[2] [2]

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety,

As of March 10, 2026: https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai 23 Featherless.ai, Goekdeniz-Guelmez/Josiefied-Qwen3-8B-abliterated-v1, undated. As of March 11, 2026: https://featherless.ai/models/Goekdeniz-Guelmez/Josiefied-Qwen3-8B-abliterated-v1 François, Camille, Ludovic Péran, Ayah Bdeir, Nouha Dziri, Will Hawkins, Yacine J...

arXiv 2026

[3] [3]

Preliminary Reporting Tiers for AI Bio Safety Evaluations,

As of March 10, 2026: https://www.frontiermodelforum.org/technical-reports/frontier-capability-assessments Frontier Model Forum, “Preliminary Reporting Tiers for AI Bio Safety Evaluations,” March 18,

2026

[4] [4]

Towards a Science of AI Evaluations,

As of January 13, 2026: https://www.frontiermodelforum.org/uploads/2025/03/PDF- Version-of-Preliminary-Reporting-Tiers.pdf Gal, Yarin, “Towards a Science of AI Evaluations,” Yarin Gal, blog, 2024. As of March 10, 2026: https://www.cs.ox.ac.uk/people/yarin.gal/website/blog_98A8.html Gal, Yarin, and Stephen Casper, “Customizable AI Systems That Anyone Can A...

work page doi:10.1038/s42256-025-00985-0 2026

[5] [5]

Simple Probes Can Catch Sleeper Agents,

As of March 11, 2026: https://arxiv.org/abs/2405.19358 MacDiarmid, Monte, Tim Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Sam Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger, “Simple Probes Can Catch Sleeper Agents,” Anthropic, April 29, 2024. As of March 10, 2026: https://www.anthropic.com/resea...

arXiv 2026

[6] [6]

The Science and Practice of Proportionality in AI Risk Evaluations,

As of March 11, 2026: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/fine-tuning-safety- evaluation?view=foundry-classic Miller, Kyle, Mia Hoffmann, and Rebecca Gelles, The Use of Open Models in Research, Center for Security and Emerging Technology (CSET), October 1, 2025. As of March 10, 2026: https://cset.georgetown.edu/publication/the-us...

work page doi:10.1126/science.aea3835 2026

[7] [7]

Steering Llama 2 via Contrastive Activation Addition,

As of March 11, 2026: https://arxiv.org/abs/2505.16789 29 Panickssery, Nina, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner, “Steering Llama 2 via Contrastive Activation Addition,” arXiv, arXiv:2312.06681 July 5, 2024. As of March 11, 2026: https://arxiv.org/abs/2312.06681 Qi, Xiangyu, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi...

arXiv 2026

[8] [8]

She conducts technical and policy research on the science of AI capability evaluations

As of March 10, 2026: https://arxiv.org/abs/2311.05553 33 About the Authors Patricia Paskov is an Oxford Martin AI Governance Research Affiliate and an Adjunct Researcher at RAND. She conducts technical and policy research on the science of AI capability evaluations. Paskov holds a M.Res. and M.Sc. in Economics. Christopher Rodriguez is a Biosecurity Rese...

arXiv 2026