pith. sign in

arxiv: 2606.19899 · v1 · pith:O26AJ7KSnew · submitted 2026-06-18 · 💻 cs.CY · cs.AI

Measuring Biological Capabilities and Risks of AI Agents

Pith reviewed 2026-06-26 15:36 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords AI agentsbiological risksevaluationsbiosecurityAI policyagentic systemsrisk assessmentscientific tasks
0
0 comments X

The pith

Choices around defining, designing, running, scoring, and documenting biological agentic evaluations materially shape what their results imply about AI risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the challenge of producing credible evidence on the biological capabilities of AI agents that can perform multi-step scientific tasks. It introduces biological agentic evaluations as a tool for risk assessment while stressing that design decisions in those evaluations determine the strength of any risk conclusions drawn. A reader would care because such systems are entering real research workflows, making it essential for decision-makers to understand what evaluation outputs can and cannot establish. The authors synthesize existing evidence and offer experience-based considerations to support cautious interpretation by policymakers, funders, and biosecurity practitioners.

Core claim

Biological agentic evaluations assess AI systems capable of autonomously or collaboratively performing multi-step scientific tasks, but choices in how these evaluations are defined, designed, run, scored, and documented materially shape what results do and do not imply about biological risk. Drawing from the authors' own evaluations, the paper supplies practical considerations intended to help interpret outputs with appropriate caution and to guide investments and assessments.

What carries the argument

Biological agentic evaluations together with the set of practical, experience-grounded considerations on how design choices affect risk implications.

If this is right

  • Policymakers should interpret biological evaluation outputs with appropriate caution.
  • Public and private funders should direct resources toward high-leverage investments in AI-biology evaluation research.
  • Biosecurity practitioners gain support when assessing emerging AI systems.
  • Researchers designing or conducting agentic evaluations receive guidance on documenting choices to clarify what results imply.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardized reporting templates could emerge as a practical response to the emphasis on documentation.
  • Meta-analyses across different labs may need to adjust for variation in evaluation design choices to remain reliable.
  • The same interpretive caution could apply to agentic evaluations in other domains such as chemical or cyber risks.

Load-bearing premise

The practical considerations drawn from the authors' own evaluations are generalizable enough to guide interpretation of results produced by other organizations and frontier systems.

What would settle it

A direct comparison of two evaluations of the same AI agent that differ in only one documented design choice, such as scoring criteria, yet produce identical conclusions about biological risk levels would challenge the central claim.

Figures

Figures reproduced from arXiv: 2606.19899 by Alyssa Worland, Jeffrey Lee, Kyle Brady, Patricia Paskov.

Figure 1
Figure 1. Figure 1: Practical considerations for defining, designing, running, scoring, and documenting [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A biological weapon risk chain (Brady & Lee et al., [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: an example decomposition of “biological tool use” into discrete tasks and subtasks. [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
read the original abstract

This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations -- drawing from our own evaluations -- that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper synthesizes evidence on AI-enabled biological risks and positions biological agentic evaluations as an interpretation-sensitive tool for assessing AI scientists and agentic systems. Its central contribution is a set of practical considerations, drawn from the authors' own evaluations, on how choices in defining, designing, running, scoring, and documenting evaluations shape what results imply about risk; these are offered to help policymakers interpret outputs, guide funders, and support biosecurity practitioners.

Significance. If the considerations hold beyond the authors' specific setups, the work could usefully caution against over-interpreting evaluation results for frontier AI biological capabilities. The experience-grounded framing is a strength for highlighting under-documented design sensitivities, but the absence of comparative evidence across organizations or systems limits the strength of claims about guiding external interpretation and investment decisions.

major comments (1)
  1. [Abstract/Introduction] Abstract and introduction: the intended uses (guiding policymakers and funders on outputs from other organizations, assessing emerging systems) require the considerations to transfer beyond the authors' evaluations, yet no comparative analysis, cross-lab validation, or evidence is provided that the identified sensitivities hold for different labs, models, or frontier-scale systems; this untested transferability assumption is load-bearing for the policy contribution.
minor comments (1)
  1. The manuscript would benefit from explicit statements distinguishing claims directly supported by the authors' evaluation data from broader interpretive guidance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major comment below and outline revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract/Introduction] Abstract and introduction: the intended uses (guiding policymakers and funders on outputs from other organizations, assessing emerging systems) require the considerations to transfer beyond the authors' evaluations, yet no comparative analysis, cross-lab validation, or evidence is provided that the identified sensitivities hold for different labs, models, or frontier-scale systems; this untested transferability assumption is load-bearing for the policy contribution.

    Authors: We agree that the intended uses described in the abstract and introduction presuppose a degree of transferability of the considerations beyond the authors' specific evaluations, and that the manuscript provides no comparative analysis or cross-lab validation to support this. The considerations are explicitly drawn from our own evaluation experience, and the paper does not claim or demonstrate that the identified sensitivities are universal across labs, models, or frontier-scale systems. To address this, we will revise the abstract and introduction to qualify the scope more precisely: the considerations are presented as experience-grounded insights intended to illustrate how design choices can affect interpretation and to encourage caution, rather than as validated general principles ready for direct application to other organizations' outputs. We will also add explicit language noting the lack of comparative evidence as a limitation and a direction for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: considerations are experience-based guidance without reduction to fitted inputs or self-citation chains

full rationale

The paper's central contribution consists of practical considerations for interpreting biological agentic evaluations, explicitly drawn from the authors' own work but presented as qualitative guidance rather than a derivation, prediction, or theorem. No equations, fitted parameters, or load-bearing self-citations appear in the provided text; the argument does not reduce any result to its inputs by construction. The manuscript is self-contained as a synthesis and advisory document whose claims rest on documented experience rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on domain assumptions about the relevance of agentic evaluations to biological risk and the transferability of lessons from the authors' evaluations; no free parameters, mathematical axioms, or invented entities are introduced.

axioms (1)
  • domain assumption Agentic evaluations can provide credible evidence about biological capabilities and risks when properly designed and interpreted.
    Invoked in the abstract as the basis for treating evaluations as a promising tool.

pith-pipeline@v0.9.1-grok · 5720 in / 1060 out tokens · 21044 ms · 2026-06-26T15:36:03.080569+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages

  1. [1]

    Introducing the Frontier Safety Framework,

    As of February 9, 2026: https://www.rand.org/pubs/research_reports/RRA4591-1.html 20 Dev, Sunishchal, Charles Teague, Grant Ellison, Kyle Brady, Ying-Chiang Jeffrey Lee, Sarah L. Gebauer, Henry Alexander Bradley, Dawid Maciorowski, Bria Persaud, Jordan Despanie, Barbara Del Castello, Alyssa Worland, Michael Miller, Adrian Salas, Dave Nguyen, James Liu, Ja...

  2. [2]

    Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark,

    As of January 13, 2026: https://www.frontiermodelforum.org/uploads/2025/03/PDF-Version-of-Preliminary- Reporting-Tiers.pdf Götting, Jasper, Pedro Medeiros, Jon G Sanders, Nathaniel Li, Long Phan, Karam Elabd, Lennart Justen, Dan Hendrycks, and Seth Donoughe, “Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark,” arXiv, April 29, 2025. As...

  3. [3]

    The Reality of AI and Biorisk,

    As of January 13, 2026: https://arxiv.org/abs/2502.10517 Paskov, Patricia, Michael J. Byun, Kevin Wei, and Toby Webster, Preliminary Suggestions for Rigorous GPAI Model Evaluations, RAND Corporation, May 1, 2025. As of January 13, 2026: https://www.rand.org/pubs/perspectives/PEA3971-1.html Peppin, Aidan, Anka Reuel, Stephen Casper, Elliot Jones, Andrew St...

  4. [4]

    Evaluating Frontier Models for Dangerous Capabilities,

    As of January 13, 2026: https://arxiv.org/pdf/2412.01946 Persaud, Bria, Ying-Chiang Jeffrey Lee, Jordan Despanie, Helin Hernandez, Henry Alexander Bradley, Sarah L. Gebauer, and Greg McKelvey, Jr., Automated Grading for Efficiently Evaluating the Dual-Use Biological Capabilities of Large Language Models, RAND Corporation, 2025. As of January 13, 2026: htt...