PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding

Aristofanis Rontogiannis; Axel Elaldi; Giosue Migliorini; Grigori Guitchounts; Nicholas Franklin; Olivia Viessmann

arxiv: 2606.27440 · v1 · pith:7LLKNOCOnew · submitted 2026-06-25 · 💻 cs.LG

PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding

Giosue Migliorini , Aristofanis Rontogiannis , Grigori Guitchounts , Nicholas Franklin , Axel Elaldi , Olivia Viessmann This is my paper

Pith reviewed 2026-06-29 01:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords PairSAEsparse autoencoderspair representationsmechanistic interpretabilityprotein co-foldingBoltz-2UniProt annotationsaffinity prediction

0 comments

The pith

PairSAE produces interpretable features from pair representations in protein co-folding models by summarizing tensors with N-mode SVD before sparse autoencoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PairSAE to solve the problem that standard sparse autoencoders fail on pairwise tensors in pairformer architectures, producing quadratic feature counts and missing concepts spread across sequence and pair data. It summarizes each pairwise tensor via N-mode SVD to obtain token-wise interaction roles, then trains one sparse autoencoder whose features decode back into both sequence embeddings and pair representations. Evaluated on Boltz-2 activations from PLINDER complexes, the resulting features align with UniProt annotations and predict affinity values, showing that the latent space can be mapped to concrete structural concepts.

Core claim

PairSAE summarizes pairwise tensors via an N-mode SVD into token-wise interaction roles, then uses a sparse autoencoder to learn a shared set of token-level features that decode into both sequence and pair representations. Evaluated on Boltz-2 activations for PLINDER protein-ligand complexes, PairSAE yields interpretable features that align with UniProt annotations and predict Boltz-2 affinity values.

What carries the argument

N-mode SVD summarization of pairwise tensors into token-wise interaction roles, followed by a shared sparse autoencoder that reconstructs both sequence and pair data.

If this is right

Features align with existing UniProt annotations on protein-ligand complexes.
The same features predict numerical affinity values from the model.
The method avoids the quadratic feature blow-up and loss of joint concepts that occur with naive application of SAEs to pair tensors.
It supplies a route from foundation-model activations to human-readable structural biology concepts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same SVD-plus-shared-SAE pattern could be tested on other pair-based architectures outside structural biology, such as graph transformers or vision-language models.
The recovered features might be used to locate and correct specific failure modes in the original model on particular interaction types.
If the features prove stable across different training runs of the base model, they could serve as a diagnostic for whether the foundation model has internalized particular biological rules.

Load-bearing premise

That reducing pairwise tensors to token-wise roles via N-mode SVD still lets a sparse autoencoder recover distributed concepts that are jointly encoded across sequence and pair representations.

What would settle it

If the extracted features show no statistical alignment with UniProt annotations or no above-chance correlation with Boltz-2 affinity values on held-out complexes, the claim that PairSAE links the latent space to interpretable structural concepts would not hold.

Figures

Figures reproduced from arXiv: 2606.27440 by Aristofanis Rontogiannis, Axel Elaldi, Giosue Migliorini, Grigori Guitchounts, Nicholas Franklin, Olivia Viessmann.

**Figure 2.** Figure 2: Count of concepts with F1 ≥ 0.5 using complex-level recall (left) and token level (right), grouped by UniProt annotation category (test-set counts in parentheses). use three nested widths c1 < c2 < c3 = D), and summing them to compute the objectives L(si) := X 3 k=1 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left: negative affinity values from Boltz-2 (higher means more affine) [Passaro et al., 2025], and predicted values from the LASSO regression in (8). Center: feature 2299 from R3-L64, displaying strong group difference in affinity values measured by Welch t-test. Right: values of feature 2299 overlaid on a ligand where it is highly activated. display a count of how many features we could predict with a sco… view at source ↗

read the original abstract

Foundation models for structural biology have achieved remarkable performance in predicting biomolecular structure and show promise for the design of proteins and small molecules. Yet understanding which internal features drive their outputs remains challenging. Standard sparse autoencoders (SAEs), effective on transformer-style sequence embeddings, do not transfer cleanly to pairformer-like architectures: naively operating on pairwise representations yields a quadratic blow-up of features and obscures concepts distributed jointly across sequence and pair representations. We introduce PairSAE, which summarizes pairwise tensors via an N-mode SVD into token-wise interaction roles, then uses a sparse autoencoder to learn a shared set of token-level features that decode into both sequence and pair representations. Evaluated on Boltz-2 activations for PLINDER protein-ligand complexes, PairSAE yields interpretable features that align with UniProt annotations and predict Boltz-2 affinity values. These results indicate that PairSAE links the latent space of foundation models for structural biology to interpretable structural concepts, clarifying what the model "knows" while avoiding pairformer-induced pitfalls that limit conventional SAEs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PairSAE gives a workable route around quadratic feature blowup in pairformer SAEs but the N-mode SVD step looks like it could drop non-separable joint patterns.

read the letter

PairSAE tries to make sparse autoencoders work on the pairwise tensors that come out of models like Boltz-2. The main move is to run an N-mode SVD on the L by L by d tensor so each token gets a compact interaction role vector, then train one SAE on those vectors that can reconstruct both the original sequence embeddings and the pair information.

That construction is new enough in this domain. Standard SAEs on sequence-only embeddings do not scale to pair representations without either massive feature counts or loss of concepts that live across the pair. The paper shows the method on PLINDER complexes, pulls out features that line up with UniProt labels, and uses them to predict affinity scores. Those are concrete outputs worth checking.

The soft spot is the SVD step itself. Reducing the full pairwise tensor to per-token summaries assumes most of the signal factors across tokens. Cooperative or higher-order patterns that do not separate that way would be truncated before the SAE sees them. The abstract gives no reconstruction error numbers after the SVD, no ablation on the rank chosen, and no direct comparison of pair-decoding fidelity with and without the reduction. If those checks are missing from the full paper too, the claim that PairSAE avoids pairformer pitfalls rests on an untested preservation assumption.

The biological alignment results are the part that matters most for readers in structural biology interpretability. People already running SAEs on protein models will want to see whether the extra machinery actually recovers distributed concepts or just re-labels what sequence-only SAEs already find. The work is narrow but timely, so it belongs in review rather than a desk reject. A referee can ask for the missing reconstruction metrics and controls on the SVD without needing to rewrite the core idea.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PairSAE to enable sparse autoencoder-based interpretability on pair representations from pairformer-style models in structural biology. Pairwise tensors are reduced via N-mode SVD to token-wise interaction roles; an SAE then learns a shared dictionary of token-level features that are decoded back into both sequence and pair representations. On Boltz-2 activations from PLINDER protein-ligand complexes, the resulting features are reported to align with UniProt annotations and to predict affinity values, thereby linking model latents to interpretable structural concepts while sidestepping quadratic feature blow-up.

Significance. If the SVD reduction demonstrably preserves distributed joint sequence-pair concepts and the reported alignments hold under quantitative controls, the work would supply a practical route for mechanistic interpretability in co-folding foundation models, a domain where standard SAEs have been limited by representation geometry.

major comments (2)

[Abstract] Abstract and implied Methods: the central claim that PairSAE 'avoids pairformer-induced pitfalls' and recovers features that 'decode into both sequence and pair representations' without loss of distributed concepts rests on the untested assumption that N-mode SVD truncation to token-wise roles preserves non-factorizable pairwise patterns (e.g., cooperative motifs). No reconstruction fidelity metrics, ablation on SVD rank, or comparison of pair-reconstruction error before versus after the reduction are supplied to substantiate this preservation.
[Abstract] Abstract: the evaluation claims that features 'align with UniProt annotations and predict Boltz-2 affinity values' but supplies neither dataset sizes, train/test splits, quantitative metrics (R², AUROC, error bars), nor controls for spurious correlation; without these the load-bearing assertion that the features are mechanistically meaningful cannot be assessed.

minor comments (1)

[Abstract] Acronyms Boltz-2 and PLINDER are used without expansion on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify gaps in quantitative validation of the N-mode SVD step and in the reporting of evaluation details. We will revise the manuscript to incorporate the requested analyses and metrics. Point-by-point responses are below.

read point-by-point responses

Referee: [Abstract] Abstract and implied Methods: the central claim that PairSAE 'avoids pairformer-induced pitfalls' and recovers features that 'decode into both sequence and pair representations' without loss of distributed concepts rests on the untested assumption that N-mode SVD truncation to token-wise roles preserves non-factorizable pairwise patterns (e.g., cooperative motifs). No reconstruction fidelity metrics, ablation on SVD rank, or comparison of pair-reconstruction error before versus after the reduction are supplied to substantiate this preservation.

Authors: We agree that explicit validation of the SVD reduction is required to support the claim that distributed pairwise concepts are preserved. The N-mode SVD is motivated as a low-rank factorization that isolates token-wise interaction roles while retaining the dominant joint sequence-pair structure, but this is an assumption that needs empirical backing. In the revised manuscript we will add (i) pair-tensor reconstruction error (Frobenius norm) before versus after truncation, (ii) an ablation over SVD rank showing the trade-off between compression and fidelity, and (iii) qualitative examples of preserved cooperative motifs. These additions will directly address the concern. revision: yes
Referee: [Abstract] Abstract: the evaluation claims that features 'align with UniProt annotations and predict Boltz-2 affinity values' but supplies neither dataset sizes, train/test splits, quantitative metrics (R², AUROC, error bars), nor controls for spurious correlation; without these the load-bearing assertion that the features are mechanistically meaningful cannot be assessed.

Authors: We acknowledge that the current abstract and main text omit the requested quantitative details. The evaluations were performed on the PLINDER protein-ligand complexes using Boltz-2 activations, but the manuscript does not report dataset cardinality, splits, or statistical controls. In revision we will explicitly state the number of complexes, the train/validation/test partitioning, the precise metrics (R² for affinity regression, AUROC for UniProt annotation alignment) with error bars across seeds, and negative controls (random features and label-shuffled baselines) to demonstrate that alignments exceed spurious correlation. These changes will make the mechanistic claims assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: method proposal contains no derivations or self-referential reductions

full rationale

The provided abstract and description introduce PairSAE as a procedural pipeline (N-mode SVD summarization followed by SAE training on token-wise roles) without any equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems. No load-bearing step reduces a claimed result to its own inputs by construction; the work is a self-contained methodological proposal whose validity rests on empirical evaluation rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5744 in / 1102 out tokens · 32422 ms · 2026-06-29T01:21:06.687334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Boltz-1 democratizing biomolecular interaction modeling.BioRxiv, pages 2024–11,

Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Noah Getz, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Liam Atkinson, Tally Portnoi, Itamar Chinn, et al. Boltz-1 democratizing biomolecular interaction modeling.BioRxiv, pages 2024–11,

2024
[2]

Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, pages 2025–06,

Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, et al. Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, pages 2025–06,

2025
[3]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892,

Adam Scherlis, Kshitij Sachan, Adam S Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892,

work page arXiv
[5]

Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19,

Andrew Ng et al. Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19,

2011
[6]

Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and C...

2023
[7]

Elana Simon and James Zou

Transformer Circuits Thread. Elana Simon and James Zou. Interplm: Discovering interpretable features in protein language models via sparse autoencoders.bioRxiv, pages 2024–11,

2024
[8]

Edith Natalia Villegas Garcia and Alessio Ansuini

URL https://transformer-circuits.pub/2024/scaling-monosemanticity/ index.html. Edith Natalia Villegas Garcia and Alessio Ansuini. Interpreting and steering protein language models through sparse autoencoders.arXiv preprint arXiv:2502.09135,

work page arXiv 2024
[9]

Towards interpretable protein structure prediction with sparse autoencoders

6 Nithin Parsan, David J Yang, and John Jingxuan Yang. Towards interpretable protein structure prediction with sparse autoencoders. InLearning Meaningful Representations of Life (LMRL) Workshop at ICLR 2025,

2025
[10]

Genome modeling and design across all domains of life with evo 2.BioRxiv, pages 2025–02,

Garyk Brixi, Matthew G Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A Gonzalez, Samuel H King, David B Li, Aditi T Merchant, et al. Genome modeling and design across all domains of life with evo 2.BioRxiv, pages 2025–02,

2025
[11]

Bart Bussmann, Patrick Leask, and Neel Nanda

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410,

work page arXiv
[12]

Plinder: The protein-ligand interactions dataset and evaluation resource.bioRxiv, pages 2024–07,

Janani Durairaj, Yusuf Adeshina, Zhonglin Cao, Xuejin Zhang, Vladas Oleinikovas, Thomas Duignan, Zachary McClure, Xavier Robin, Gabriel Studer, Daniel Kovtun, et al. Plinder: The protein-ligand interactions dataset and evaluation resource.bioRxiv, pages 2024–07,

2024
[13]

Uniprot: the universal protein knowledgebase in 2025.Nucleic acids research, 53(D1): D609–D617,

UniProt. Uniprot: the universal protein knowledgebase in 2025.Nucleic acids research, 53(D1): D609–D617,

2025
[14]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Following De Lathauwer et al

8 AN-mode Singular Value Decomposition The mode-k unfolding of a tensor T ∈R n1×n2×···×nN is obtained by flattening all but the kth dimension into a matrix T (k) ∈R nk×Q i̸=k ni. Following De Lathauwer et al. [2000a], every tensor admits the higher-order singular value decomposition T=C × 1 U(1) ×2 U(2) · · · ×N U(N) , where C ∈R n1×n2×···×nN is a core te...

2002
[16]

We truncate to the first r= 64 columns, and if Ntok < r we fill the remaining entries with zeroes

We compute the N−mode SVD by simply flattening Z to its mode-1 and mode-2 unfolding, and obtain the SVD of these matrices using numpy.linalg.svd. We truncate to the first r= 64 columns, and if Ntok < r we fill the remaining entries with zeroes. After concatenating sequence embeddings s and the SVD-derived embedding m into a 512-dimensional vector, we perf...

2014

[1] [1]

Boltz-1 democratizing biomolecular interaction modeling.BioRxiv, pages 2024–11,

Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Noah Getz, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Liam Atkinson, Tally Portnoi, Itamar Chinn, et al. Boltz-1 democratizing biomolecular interaction modeling.BioRxiv, pages 2024–11,

2024

[2] [2]

Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, pages 2025–06,

Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, et al. Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, pages 2025–06,

2025

[3] [3]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892,

Adam Scherlis, Kshitij Sachan, Adam S Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892,

work page arXiv

[5] [5]

Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19,

Andrew Ng et al. Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19,

2011

[6] [6]

Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and C...

2023

[7] [7]

Elana Simon and James Zou

Transformer Circuits Thread. Elana Simon and James Zou. Interplm: Discovering interpretable features in protein language models via sparse autoencoders.bioRxiv, pages 2024–11,

2024

[8] [8]

Edith Natalia Villegas Garcia and Alessio Ansuini

URL https://transformer-circuits.pub/2024/scaling-monosemanticity/ index.html. Edith Natalia Villegas Garcia and Alessio Ansuini. Interpreting and steering protein language models through sparse autoencoders.arXiv preprint arXiv:2502.09135,

work page arXiv 2024

[9] [9]

Towards interpretable protein structure prediction with sparse autoencoders

6 Nithin Parsan, David J Yang, and John Jingxuan Yang. Towards interpretable protein structure prediction with sparse autoencoders. InLearning Meaningful Representations of Life (LMRL) Workshop at ICLR 2025,

2025

[10] [10]

Genome modeling and design across all domains of life with evo 2.BioRxiv, pages 2025–02,

Garyk Brixi, Matthew G Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A Gonzalez, Samuel H King, David B Li, Aditi T Merchant, et al. Genome modeling and design across all domains of life with evo 2.BioRxiv, pages 2025–02,

2025

[11] [11]

Bart Bussmann, Patrick Leask, and Neel Nanda

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410,

work page arXiv

[12] [12]

Plinder: The protein-ligand interactions dataset and evaluation resource.bioRxiv, pages 2024–07,

Janani Durairaj, Yusuf Adeshina, Zhonglin Cao, Xuejin Zhang, Vladas Oleinikovas, Thomas Duignan, Zachary McClure, Xavier Robin, Gabriel Studer, Daniel Kovtun, et al. Plinder: The protein-ligand interactions dataset and evaluation resource.bioRxiv, pages 2024–07,

2024

[13] [13]

Uniprot: the universal protein knowledgebase in 2025.Nucleic acids research, 53(D1): D609–D617,

UniProt. Uniprot: the universal protein knowledgebase in 2025.Nucleic acids research, 53(D1): D609–D617,

2025

[14] [14]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Following De Lathauwer et al

8 AN-mode Singular Value Decomposition The mode-k unfolding of a tensor T ∈R n1×n2×···×nN is obtained by flattening all but the kth dimension into a matrix T (k) ∈R nk×Q i̸=k ni. Following De Lathauwer et al. [2000a], every tensor admits the higher-order singular value decomposition T=C × 1 U(1) ×2 U(2) · · · ×N U(N) , where C ∈R n1×n2×···×nN is a core te...

2002

[16] [16]

We truncate to the first r= 64 columns, and if Ntok < r we fill the remaining entries with zeroes

We compute the N−mode SVD by simply flattening Z to its mode-1 and mode-2 unfolding, and obtain the SVD of these matrices using numpy.linalg.svd. We truncate to the first r= 64 columns, and if Ntok < r we fill the remaining entries with zeroes. After concatenating sequence embeddings s and the SVD-derived embedding m into a 512-dimensional vector, we perf...

2014