Mechanistic Interpretability of Antibody Language Models Using SAEs

Anisha Parsan; Anna L. Beukenhorst; Charlotte M. Deane; John J. Yang; Nithin Parsan; Oliver M. Turnbull; Rebonto Haque

arxiv: 2512.05794 · v3 · pith:VJLPHIXBnew · submitted 2025-12-05 · 💻 cs.LG · cs.AI· q-bio.QM

Mechanistic Interpretability of Antibody Language Models Using SAEs

Rebonto Haque , Oliver M. Turnbull , Anisha Parsan , Nithin Parsan , John J. Yang , Anna L. Beukenhorst , Charlotte M. Deane This is my paper

Pith reviewed 2026-05-17 00:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.QM

keywords sparse autoencodersmechanistic interpretabilityantibody language modelsprotein language modelsfeature steeringhierarchical SAEslatent featuresgenerative control

0 comments

The pith

Ordered SAEs reliably identify steerable features in antibody language models at the cost of more complex activation patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates how sparse autoencoders can be used to understand and control antibody language models. TopK SAEs successfully map latent features to biological concepts but do not always allow direct control over the model's generated sequences. Ordered SAEs add a hierarchical structure to the features, which makes it possible to steer generation more reliably. However, this comes with activation patterns that are harder to interpret. The findings help choose the right tool for different interpretability goals in protein models.

Core claim

TopK SAEs can reveal biologically meaningful latent features in autoregressive antibody language models, but high feature-concept correlation does not guarantee causal control over generation. Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, though this results in more complex and less interpretable activation patterns.

What carries the argument

TopK and Ordered sparse autoencoders (SAEs) applied to extract and steer latent features in antibody language models.

If this is right

TopK SAEs suffice for mapping latent features to biological concepts.
Ordered SAEs are better when precise generative steering is needed.
Mechanistic interpretability advances for domain-specific protein language models.
SAE choice depends on whether the goal is concept mapping or output control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the hierarchical structure in Ordered SAEs generalizes, it could help interpret other specialized language models beyond antibodies.
Combining TopK for initial mapping with Ordered for steering might offer a hybrid approach to interpretability.
Testing these methods on larger antibody datasets or different protein families would check broader applicability.

Load-bearing premise

Observed correlations between SAE features and biological concepts reflect causal mechanisms rather than spurious associations, and steering success generalizes to untested antibody sequences.

What would settle it

A test showing that steering using Ordered SAE features fails to alter generated antibodies in the expected biological way on a held-out set of sequences, or that TopK features do enable such control.

Figures

Figures reproduced from arXiv: 2512.05794 by Anisha Parsan, Anna L. Beukenhorst, Charlotte M. Deane, John J. Yang, Nithin Parsan, Oliver M. Turnbull, Rebonto Haque.

**Figure 1.** Figure 1: Latent activations (a) and neuron activations (b) for CDRH3 identity, and latent activations for IGHJ3 (c). The x-axis shows the amino-acid sequence of the VH region of a test antibody; the y-axis shows normalised activation. CDRs are coloured CDRH1 (red), CDRH2 (blue), and CDRH3 (green). Latent activations localise to the expected regions—CDRH3 in (a) and the heavy J region in (c)—whereas neuron activatio… view at source ↗

**Figure 2.** Figure 2: Comparison of absolute positional (a) and IMGT (b) activations of top three IGHJ4 latents. The sequence/IMGT positions are shown on the x-axis. For the sequence positions, the amino acid sequences were end-padded to a constant length of 350. Percentage of total activations on any given position across validation IGHJ4 sequences is shown on the y-axis. The most frequent IMGT position for activation is highl… view at source ↗

**Figure 3.** Figure 3: Results of IGHJ4 feature steering for latent 463 (a), 4720 (b), 6276 (c). Y-axis shows the proportion of generated sequences. Plots are coloured by heavy J gene identity. X-axis shows the steering factor used (alpha). Results are for a library of 1000 pIgGen-generated sequences. For each latent tested (a-c), steering did not result in a predictable change in library composition. ranked features with an F-… view at source ↗

**Figure 4.** Figure 4: Results of IGHJ4 steering using Ordered latent 12 (a) and 49 (b). Y-axis shows the proportion of generated sequences. Plots are coloured by heavy J gene identity. X-axis shows the steering factor used (alpha). Results are for a library of 1000 p-IgGengenerated sequences. Latent 12—positively correlated with IGHJ4—increases IGHJ4 proportion under positive steering, whereas latent 49—negatively correlated—d… view at source ↗

**Figure 5.** Figure 5: IMGT activations of latent 12 (a) and 49 (b). Activation patterns of both latents show scattered distribution across the range of IMGT positions. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ordered SAEs beat TopK for steering antibody models but the win might be an artifact of the ordering rather than a discovery of causal structure.

read the letter

The punchline is straightforward: in antibody language models, Ordered SAEs deliver more reliable steering of generations than TopK SAEs, but they come with less interpretable and more complex activation patterns. The paper applies these SAE methods to autoregressive antibody models and compares their performance on two fronts. First, mapping latent features to biological concepts. Second, using those features to steer the model's output. TopK SAEs do a decent job identifying meaningful features tied to things like specific amino acid patterns or functional properties. However, intervening on them does not always produce the expected changes in generated antibodies. Ordered SAEs, which enforce a hierarchical ordering on the features, make the steering more consistent and effective. This comparison is the useful part. It gives concrete guidance on matching the SAE type to the goal—concept discovery versus actual control. For applications in AI-assisted antibody design, where precise modifications to sequences matter, this distinction could help practitioners choose the right tool. The soft spots are around the causal interpretation. The stress test raises a fair point: the ordering constraint might simply bias the features toward ones that are easier to steer, rather than uncovering directions that are inherently causal in the original model. An experiment that compares to a non-ordered version with similar sparsity would clarify whether the hierarchy is revealing something new or just imposing structure. Also, the lack of reported dataset sizes, statistical significance, or effect magnitudes in the summary makes it difficult to judge how general these findings are beyond the tested cases. Overall, this work targets people already working at the intersection of mechanistic interpretability and biological sequence models. A reader interested in improving control over generative models for medicine would get value from the empirical observations. The argument holds together without obvious internal contradictions. I would recommend putting it through peer review. The topic is relevant and the basic comparison is worth exploring further with tighter experiments.

Referee Report

3 major / 2 minor

Summary. The manuscript applies TopK and Ordered Sparse Autoencoders (SAEs) to autoregressive antibody language models to extract latent features and enable generative steering. It reports that TopK SAEs identify biologically meaningful features but that high feature-concept correlation does not guarantee causal control over generation. Ordered SAEs impose a hierarchical structure that reliably surfaces steerable features, at the cost of more complex and less interpretable activation patterns. The work positions these distinctions as guidance for choosing SAE variants in domain-specific protein language models depending on whether the goal is concept mapping or precise steering.

Significance. If the empirical distinctions are substantiated, the paper would provide useful practical guidance on SAE selection for mechanistic interpretability in biological sequence models. It extends SAE techniques to antibody LMs and usefully separates correlation from causal control, which is relevant as these models see increasing use in design tasks. The trade-off between interpretability and steerability is a concrete contribution that could inform tool choice in the field.

major comments (3)

[Abstract and §4] Abstract and §4 (Steering Experiments): The central claim that Ordered SAEs 'reliably identify steerable features' via imposed hierarchy lacks an ablation that removes the ordering constraint while holding sparsity and reconstruction loss fixed. Without this control it remains unclear whether the reported steering advantage reflects discovered model-internal causal directions or is an artifact of the ordering itself biasing controllable axes.
[§3 and Results tables] §3 (Methods) and Results tables: No dataset sizes, number of antibody sequences, exact correlation or steering-success metrics, or statistical tests are described. This absence makes it impossible to assess whether the claimed distinctions between TopK and Ordered SAEs are robust or reproducible.
[§5] §5 (Discussion): The assertion that high feature-concept correlation does not guarantee causal control for TopK SAEs is load-bearing for the recommendation to prefer Ordered SAEs for steering; the manuscript must show concrete counter-examples where high correlation fails to produce steering success, with quantitative effect sizes.

minor comments (2)

[§2] §2 (Background): The precise definition and implementation details of the ordering constraint in Ordered SAEs versus standard TopK should be expanded for reproducibility.
[Figures] Figure captions: Activation-pattern visualizations should explicitly label which panels correspond to TopK versus Ordered SAEs and quantify the claimed increase in complexity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the clarity and rigor of our empirical claims. We address each of the major comments in detail below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Steering Experiments): The central claim that Ordered SAEs 'reliably identify steerable features' via imposed hierarchy lacks an ablation that removes the ordering constraint while holding sparsity and reconstruction loss fixed. Without this control it remains unclear whether the reported steering advantage reflects discovered model-internal causal directions or is an artifact of the ordering itself biasing controllable axes.

Authors: We agree that a controlled ablation isolating the ordering constraint while holding sparsity and reconstruction loss fixed would strengthen the evidence. The Ordered SAE is defined by its hierarchical ordering mechanism, which is the key distinction from TopK SAEs in our comparisons. In the revised manuscript, we will add a dedicated discussion of this architectural difference and include a sensitivity analysis or partial ablation where feasible to better isolate the contribution of the ordering. revision: partial
Referee: [§3 and Results tables] §3 (Methods) and Results tables: No dataset sizes, number of antibody sequences, exact correlation or steering-success metrics, or statistical tests are described. This absence makes it impossible to assess whether the claimed distinctions between TopK and Ordered SAEs are robust or reproducible.

Authors: We apologize for the omission of these critical details. In the revised manuscript, we will expand §3 and the results tables to report the exact number of antibody sequences used, dataset sizes for SAE training and evaluation, numerical values for all correlation and steering-success metrics, and appropriate statistical tests (including p-values) to demonstrate the robustness of the observed differences. revision: yes
Referee: [§5] §5 (Discussion): The assertion that high feature-concept correlation does not guarantee causal control for TopK SAEs is load-bearing for the recommendation to prefer Ordered SAEs for steering; the manuscript must show concrete counter-examples where high correlation fails to produce steering success, with quantitative effect sizes.

Authors: We will strengthen this section by adding explicit counter-examples drawn from our steering experiments. These will include specific TopK features with high concept correlations that nonetheless showed low steering success, accompanied by quantitative effect sizes such as changes in generation probabilities or success rates, to directly support the distinction from Ordered SAEs. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; claims rest on direct experimental comparisons of SAE variants.

full rationale

The paper applies TopK and Ordered SAEs to antibody language models and reports empirical outcomes on feature-concept correlations and steering success. No derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present. The central claims compare observed performance differences between SAE architectures on biological data without reducing to self-definitional or ansatz-smuggled constructions. Any self-citations are peripheral and not required for the reported findings, which remain falsifiable via external replication on held-out sequences.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.0 · 5457 in / 1027 out tokens · 39114 ms · 2026-05-17T00:23:30.871293+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ordered SAEs impose a hierarchical structure that reliably identifies steerable features... per-index nested grouping and strictly decreasing truncation weights
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

high feature–concept correlation does not guarantee causal control over generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.