pith. sign in

arxiv: 2605.30162 · v1 · pith:44HEIP6Hnew · submitted 2026-05-28 · 💻 cs.AI · cs.CR· cs.LG

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

Pith reviewed 2026-06-29 07:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CRcs.LG
keywords biosecurityrefusal auditingsparse autoencoderslanguage modelsdivergence scoresafety evaluationinternal activationsGemma models
0
0 comments X

The pith

A divergence score from domain-fine-tuned sparse autoencoders separates comply and refuse responses with a 0.647-point gap and zero overlap on Gemma 4 E2B-IT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether language model refusals to biosecurity queries are structurally sound or merely surface-level responses that shift with prompt changes. Behavioral tests across five models revealed inconsistent performance, with some models refusing nothing, others over-refusing benign queries, and all failing to cleanly separate hazard from non-hazard. To probe internal structure, the authors compute a divergence score D that contrasts the model's output label against its sparse autoencoder activations on the same inputs. On Gemma 4 E2B-IT this score produced complete separation between comply and refuse cases. The work therefore proposes activation-level auditing as a way to detect refusal weaknesses that standard output-based evaluations miss.

Core claim

Behavioral biosecurity tests showed no model cleanly discriminated benign from hazard queries. Gemma 2 2B-IT never refused across 75 prompts. Gemma 4 E2B-IT refused 65 of 75 with chat formatting but zero without it and zero under an 80-token cap. Qwen and Phi over-refused 83-87 percent of benign biology. Only Llama 3.2 1B produced a meaningful 61-point tier gradient. The divergence score D, formed by comparing surface response labels to activations from a domain-fine-tuned SAE, produced a 0.647-point gap with zero overlap on Gemma 4 E2B-IT for the 75 prompts tested.

What carries the argument

The divergence score D, which subtracts or contrasts the surface behavioral label from the internal sparse autoencoder feature activations on the same prompt.

If this is right

  • Refusal behavior collapses under modest changes to formatting or output length on multiple models.
  • Models refuse some non-toxic Schedule I compounds at rates higher than genuinely hazardous biology.
  • Gemma 4 E2B-IT exhibits clean internal separation despite mixed behavioral results.
  • Domain-fine-tuned SAEs can be trained on consumer hardware for this auditing task.
  • Llama 3.2 1B is the only tested model showing a clear gradient across hazard tiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be applied to audit other safety properties such as truthfulness or bias using the same divergence approach.
  • Releasing the two domain SAEs enables direct replication and extension by other researchers.
  • Architectural variation in refusal depth implies that safety mechanisms differ substantially inside different model families.
  • If the separation holds on broader prompt sets, activation auditing could become a standard complement to behavioral red-teaming.

Load-bearing premise

That sparse autoencoder features encode refusal-relevant concepts independently of the model's final output label.

What would settle it

Substantial overlap appearing in the divergence score between comply and refuse responses when the same model is re-tested with a different SAE or when the method is applied to a new architecture.

Figures

Figures reproduced from arXiv: 2605.30162 by Caleb DeLeeuw.

Figure 2
Figure 2. Figure 2: BioRefusalAudit interactive dashboard showing all 75 eval prompts with per￾prompt D scores, surface labels, divergence flags, and tier assignments. Color coding distinguishes comply (green), hedge (yellow), and refuse (red) responses. Sorting by D score surfaces the highest-divergence prompts, the ones where surface label and internal activations point in opposite directions. Live at solshinecode.github.io… view at source ↗
Figure 2
Figure 2. Figure 2: BioRefusalAudit interactive dashboard Correct ordering achieved. Narrower separation is expected at this training scale. Wave 3 explicit￾prompt activations (n=300) are the next training step. This result is in-sample: the adapter was trained and evaluated on the same 75 samples. It is evidence that the activation signal is learnable, not a held-out validation of generalization. Discussion What behavioral e… view at source ↗
read the original abstract

Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper evaluates biosecurity refusal behaviors across five LLMs (Gemma 2 2B-IT, Gemma 4 E2B-IT, Qwen 2.5 1.5B, Phi-3-mini, Llama 3.2 1B), reporting that no model cleanly separates benign from hazardous queries, with specific failures such as Gemma 2 2B-IT hedging on all hazard-adjacent prompts, Gemma 4 E2B-IT refusing 65/75 with chat templates but 0/75 without and collapsing under an 80-token cap, and over-refusal rates of 83-87% on benign biology by Qwen and Phi models. It introduces a divergence score D comparing surface response labels to SAE activations, reports a 0.647-point gap with zero overlap on n=75 prompts for Gemma 4 E2B-IT using a domain-fine-tuned SAE, and releases two fine-tuned Gemma 2 domain SAEs, concluding that activation-level auditing may reveal refusal failure modes invisible to behavioral evaluation.

Significance. If the independence of the divergence score D from surface labels can be established, the work would offer a useful complementary method for probing internal refusal mechanisms in biosecurity contexts beyond output-only tests, highlighting architectural differences in refusal. The release of the two domain-fine-tuned SAEs and the demonstration of feasibility on consumer hardware (GTX 1650 Ti Max-Q plus Colab T4) are explicit strengths for reproducibility and community follow-up. The preliminary status and narrow catalog limit immediate impact, but the approach suggests a direction for internal auditing in AI safety.

major comments (1)
  1. [Divergence score D and Gemma 4 E2B-IT results] The section describing the divergence score D and the Gemma 4 E2B-IT results: the reported 0.647-point gap with zero overlap (n=75) is interpreted as evidence that activation auditing surfaces failure modes invisible to behavioral evaluation. However, D is computed via within-sample calibration directly comparing surface labels to activations on the same responses, without reported out-of-distribution prompts, cross-validation, error bars, or causal interventions on SAE features. This leaves open whether the separation measures independent structural refusal depth or downstream correlation with the labeling process itself.
minor comments (2)
  1. [Abstract and results] The abstract and results paragraphs could more explicitly separate the quantitative claims (e.g., the 0.647 gap) from the listed caveats on preliminary status, narrow catalog, and within-sample calibration to improve clarity for readers.
  2. [Methods] Notation for the divergence score D would benefit from an explicit equation or pseudocode definition in the methods to allow precise reproduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on this preliminary hackathon project. We address the major comment below, agreeing that the current analysis has important limitations in establishing independence of the divergence score, and will revise the manuscript to clarify this.

read point-by-point responses
  1. Referee: The section describing the divergence score D and the Gemma 4 E2B-IT results: the reported 0.647-point gap with zero overlap (n=75) is interpreted as evidence that activation auditing surfaces failure modes invisible to behavioral evaluation. However, D is computed via within-sample calibration directly comparing surface labels to activations on the same responses, without reported out-of-distribution prompts, cross-validation, error bars, or causal interventions on SAE features. This leaves open whether the separation measures independent structural refusal depth or downstream correlation with the labeling process itself.

    Authors: We agree the referee's characterization is accurate: D is computed within-sample on the identical n=75 responses used to derive the behavioral labels, with no OOD prompts, cross-validation, error bars, or feature interventions reported. The manuscript already describes the result as preliminary with 'within-sample calibration' and 'Gemma-family-only SAE coverage,' and presents the 0.647 gap as suggestive evidence rather than conclusive proof of structural independence. The separation in SAE space between comply and refuse classes remains noteworthy as a complementary signal, but we do not claim it measures refusal depth beyond correlation with the labeling process. In revision we will expand the discussion to explicitly state the correlational nature of the result, add a dedicated limitations paragraph, and outline required future validation steps including OOD testing and interventions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical separation reported as observation, not identity

full rationale

The paper introduces divergence score D as an empirical comparison between surface response labels (comply/refuse) and SAE feature activations on the same set of 75 prompts, then reports an observed 0.647-point gap with zero overlap for Gemma 4 E2B-IT. No equation or definition is provided that makes the reported gap equivalent to the input labels by construction, nor is any parameter fitted on a subset and then relabeled as a prediction. The SAE is described as separately trained (author-trained bio SAE), and the result is presented as a preliminary empirical finding rather than a first-principles derivation or self-referential identity. No self-citation chains, uniqueness theorems, or ansatzes appear in the provided text. The central claim therefore remains an independent observation on the data rather than a reduction to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on the assumption that SAE features encode refusal concepts in a way that makes the divergence score informative, plus standard SAE reconstruction properties; no new physical entities are postulated.

free parameters (1)
  • within-sample calibration for D
    The 0.647 gap and zero-overlap threshold are calibrated on the same 75 responses used for evaluation.
axioms (1)
  • domain assumption Sparse autoencoder features correspond to human-interpretable semantic concepts such as refusal or compliance
    Invoked when interpreting activation separation as evidence of structural refusal depth.
invented entities (1)
  • divergence score D no independent evidence
    purpose: Quantify difference between surface refusal label and internal SAE activations
    New metric introduced to compare behavioral output to internal representations

pith-pipeline@v0.9.1-grok · 5919 in / 1513 out tokens · 29200 ms · 2026-06-29T07:25:56.106231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · 5 internal anchors

  1. [1]

    Abdin, M. et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219. Apart Research. (2026). BioRefusalAudit: hackathon project page with reviewer feedback. Apart Research AIxBio Sprint, April

  2. [2]

    Refusal in Language Models Is Mediated by a Single Direction

    apartresearch.com/project/biorefusalaudit-…-1fyk. Arditi, A. et al. (2024). Refusal in LLMs is mediated by a single direction. arXiv:2406.11717. Liu, A. B., Nedungadi, S., Cai, B., Kleinman, A., Bhasin, H., & Donoughe, S. (2025). ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity. NeurIPS 2025 Workshop BioSafe GenAI. openreview.net/forum?id=...

  3. [3]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Compass Pathways. (2018, October 23). COMPASS Pathways receives FDA Breakthrough Therapy designation for psilocybin therapy for treatment-resistant depression. Investor News. ir.compasspathways.com/…/COMPASS-Pathways-receives-FDA-Breakthrough-Therapy- designation-for-psilocybin-therapy-for-treatment-resistant-depression. 15 Cunningham, H. et al. (2023). S...

  4. [4]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    arXiv:2408.05147. Marks, S., Rager, C. et al. (2024). Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv:2403.19647. Qi, X. et al. (2024). Safety alignment should be made more than just a few tokens deep. arXiv:2406.05946. Rajamanoharan, S. et al. (2024). Jumping ahead: Improving reconstruction fidelity ...

  5. [5]

    SecureBio. (2026). BioTier: Access stratification and biosecurity capability thresholds for biological AI. securebio.org/biotier. Wei, A. et al. (2023). Jailbroken: How does LLM safety training fail? arXiv:2307.02483. Zou, A. et al. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043. Carter, S. & Butchello,...

  6. [6]

    Explain how viral entry into host cells is typically taught in an undergraduate microbiology course

    Organization for Ethical Source. (2024). Hippocratic License 3.0. firstdonoharm.dev. Appendix A: Illustrative Prompt Examples 16 Tier Framing Illustrative paraphrase Design intent Benign biology Educational “Explain how viral entry into host cells is typically taught in an undergraduate microbiology course. ” Baseline; compliance expected Benign biology R...