SAE-RNA: A Sparse Autoencoder Model for Interpreting RNA Language Model Representations

Sangdae Nam; Taehan Kim

arxiv: 2510.02734 · v2 · pith:7MDOUZ53new · submitted 2025-10-03 · 🧬 q-bio.BM · cs.AI· q-bio.GN

SAE-RNA: A Sparse Autoencoder Model for Interpreting RNA Language Model Representations

Taehan Kim , Sangdae Nam This is my paper

Pith reviewed 2026-05-21 21:19 UTC · model grok-4.3

classification 🧬 q-bio.BM cs.AIq-bio.GN

keywords sparse autoencoderRNA language modelmodel interpretabilityRiNALMobiological feature mappingrepresentation analysisRNA familystructural context

0 comments

The pith

Sparse autoencoders decompose RNA language model representations into features that align with known biological concepts such as RNA families and structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAE-RNA, which applies a sparse autoencoder to the internal representations produced by the RNA language model RiNALMo. It tests whether the resulting sparse features can be linked to recognizable biological properties like RNA family membership or structural context. The work presents this mapping as a probe that reveals how the model organizes biological information internally rather than as a claim of new biological discovery. A reader would care because the method offers a concrete way to inspect and compare the internal organization of these black-box models used for biomolecular sequences.

Core claim

SAE-based analysis serves as a representation-level probe for characterizing how RNA language models organize biological information internally, mapping sparse features to known human-level biological features such as RNA family identity or structural context. SAE-RNA provides a feature-level framework for comparing RNA groups and identifying sparse representation components associated with these properties.

What carries the argument

Sparse autoencoder decomposition of RiNALMo representations that produces sparse features aligned with external biological annotations.

If this is right

RNA language models can be analyzed at the representation level by decomposing their activations with sparse autoencoders.
Specific sparse features recovered in this way correspond to RNA family identity and structural context.
The resulting features supply a systematic way to compare how different groups of RNA sequences are represented inside the model.
The same approach extends representation-level interpretability techniques already explored for protein language models to the RNA setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The probe could be run on other RNA language models to test whether the recovered features remain consistent across architectures.
If the features prove stable, they might be used to diagnose why a model succeeds or fails on particular RNA sequences.
The method suggests a route for adding human-readable handles to otherwise opaque sequence models in RNA therapeutics or structure design.

Load-bearing premise

The sparse features recovered by the autoencoder correspond to meaningful biological concepts rather than training artifacts or spurious correlations in the RiNALMo representations.

What would settle it

Finding that the extracted sparse features show no better-than-random alignment with independent RNA family or structure labels on held-out sequences would show the mapping does not hold.

read the original abstract

Deep learning, particularly with the advancement of Large Language Models, has transformed biomolecular modeling, with protein language models such as ESM inspiring emerging RNA language models such as RiNALMo. Recent work has begun applying sparse autoencoders (SAEs) to protein language model representations, exploring representation-level interpretability in biomolecular models. Here, we explore whether SAEs can provide interpretable feature decompositions of RNA language model representations, while also examining their limitations in this setting. We present SAE-RNA, interpretability model that analyzes RiNALMo representations and maps them to known human-level biological features. Rather than claiming definitive biological concept discovery, our study frames SAE-based analysis as a representation-level probe for characterizing how RNA language models organize biological information internally. More broadly, SAE-RNA provides a feature-level framework for comparing RNA groups and identifying sparse representation components associated with RNA family identity or structural context.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SAE-RNA, a sparse autoencoder applied to representations from the RiNALMo RNA language model. It decomposes these representations into sparse features and maps them to known biological annotations such as RNA family identity and structural context. The work is explicitly framed as an exploratory representation-level probe for characterizing internal organization of biological information in RNA LMs, without claiming definitive concept discovery, and offers a feature-level framework for comparing RNA groups.

Significance. If the recovered sparse features can be shown to align with biological concepts beyond what arises from RiNALMo pretraining artifacts, SAE inductive bias, or post-hoc label correlations, the approach would usefully extend SAE-based interpretability methods from protein LMs to the RNA domain and supply a concrete tool for feature-level RNA group comparisons. The current exploratory framing and absence of supporting quantitative evidence keep the immediate significance modest.

major comments (2)

[Abstract] Abstract: The central mapping claim—that SAE features serve as a probe revealing how RiNALMo organizes biological information—lacks any reported quantitative metrics, error analysis, ablation studies, or baseline comparisons to establish that alignments exceed those expected from training artifacts or annotation correlations.
[Abstract] The manuscript provides no description of how feature–annotation alignments are quantified (e.g., activation correlations, precision-recall against family labels, or statistical tests against shuffled controls), leaving the weakest assumption—that recovered directions reflect meaningful biology rather than spurious correlations—untested.

minor comments (1)

[Methods] Clarify the precise SAE training objective and hyperparameter choices (dictionary size, sparsity coefficient, learning rate) in the methods section to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We agree that additional quantitative support would strengthen the presentation of our exploratory probe and will revise the manuscript accordingly. Our point-by-point responses to the major comments follow.

read point-by-point responses

Referee: [Abstract] Abstract: The central mapping claim—that SAE features serve as a probe revealing how RiNALMo organizes biological information—lacks any reported quantitative metrics, error analysis, ablation studies, or baseline comparisons to establish that alignments exceed those expected from training artifacts or annotation correlations.

Authors: We acknowledge this observation. The manuscript is explicitly positioned as an exploratory representation-level probe rather than a claim of definitive concept discovery. To address the concern, we will revise the abstract and add a dedicated subsection in Results that reports quantitative metrics, including Pearson correlations of feature activations with family and structural annotations, precision-recall against held-out labels, and direct comparisons to shuffled controls and random baselines. We will also include a brief ablation on SAE sparsity and reconstruction error to help separate biological signal from artifacts. revision: yes
Referee: [Abstract] The manuscript provides no description of how feature–annotation alignments are quantified (e.g., activation correlations, precision-recall against family labels, or statistical tests against shuffled controls), leaving the weakest assumption—that recovered directions reflect meaningful biology rather than spurious correlations—untested.

Authors: We agree that an explicit description of the quantification procedure is required. Although the current text describes mapping via activation patterns, we will expand the Methods section to detail the exact statistical procedures: activation–annotation correlations, precision-recall curves against RNA family labels, and permutation tests against shuffled annotation controls. These additions will be summarized in the revised abstract and will allow readers to evaluate whether recovered directions exceed spurious correlations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SAE application to RiNALMo representations is self-contained

full rationale

The manuscript trains a standard sparse autoencoder on RiNALMo hidden states and then correlates the resulting sparse features with external biological annotations (RNA family labels, structural context). No derivation chain exists in which a claimed prediction or first-principles result is shown to be identical to its inputs by construction. The work explicitly frames itself as a representation-level probe rather than a discovery claim, invokes no uniqueness theorems, and does not smuggle ansatzes or rename known results via self-citation. All reported mappings are post-hoc empirical observations whose validity can be tested against held-out annotations or alternative models, satisfying the criteria for non-circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that sparse autoencoders will recover biologically meaningful features from RNA language model representations; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Sparse autoencoders applied to language model representations can yield features that align with human-interpretable biological concepts
Invoked when the study frames SAE analysis as a probe that maps to known biological features

pith-pipeline@v0.9.0 · 5690 in / 1283 out tokens · 54419 ms · 2026-05-21T21:19:58.511149+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train a traditional overcomplete SAEs to decompose dense embeddings x∈R^d into sparse features f∈R^k ... L=∥x−x̂∥²₂ + λ∥f∥₁
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

deeper layers transition from diffuse to type-selective activations ... small subset of channels preferentially firing for specific ncRNA families

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.