Atlas-Alignment: Making Interpretability Transferable Across Language Models

Bruno Puri; Jim Berend; Sebastian Lapuschkin; Wojciech Samek

arxiv: 2510.27413 · v2 · submitted 2025-10-31 · 💻 cs.LG · cs.AI· cs.CL

Atlas-Alignment: Making Interpretability Transferable Across Language Models

Bruno Puri , Jim Berend , Sebastian Lapuschkin , Wojciech Samek This is my paper

Pith reviewed 2026-05-18 02:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords interpretabilitylanguage modelslatent space alignmentconcept atlassemantic retrievalsteerable generationmechanistic interpretabilityrepresentational alignment

0 comments

The pith

Atlas-Alignment aligns a new language model's latent space to a pre-labeled Concept Atlas using shared inputs alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Atlas-Alignment to transfer interpretability across language models without repeating expensive labeling or training steps for each new model. It aligns the latent representations of any new model to those of an existing labeled Concept Atlas by applying lightweight methods to shared input examples. This transfer lets the new model inherit concept labels for tasks like semantic retrieval and activation steering. The approach rests on the idea that a single high-quality atlas can serve many models at low additional cost. In this way the work aims to reduce the per-model overhead that currently slows scalable interpretability.

Core claim

Atlas-Alignment is a framework that aligns the latent space of a new model to a pre-existing, labeled Concept Atlas using only shared inputs and lightweight representational alignment methods. Through quantitative and qualitative evaluations, the paper shows that simple alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept datasets for the new model.

What carries the argument

Atlas-Alignment, the procedure that maps a new model's latent space onto a fixed labeled Concept Atlas via lightweight representational alignment on shared inputs.

If this is right

Interpretability costs amortize: one Concept Atlas can serve many models.
New models gain semantic retrieval and concept-based steering without fresh labeled data.
The transparency tax shrinks because alignment replaces model-specific labeling and validation.
Mechanistic interpretability keeps pace with rapid model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment idea could be tested on models trained on very different data to check how far shared semantic structure extends.
Combining Atlas-Alignment with existing interpretability tools might increase robustness on edge cases.
If alignment works across scales, it could support quick interpretability checks during model iteration cycles.

Load-bearing premise

The latent spaces of different language models share enough semantic structure that simple alignment on shared inputs transfers concept labels without substantial distortion.

What would settle it

A clear mismatch between the transferred labels and the actual concept activations in the new model, measured by low retrieval accuracy or ineffective steering on held-out inputs, would show the alignment failed to preserve meaning.

read the original abstract

Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires training model-specific components (e.g., sparse autoencoders), followed by manual or semi-automated labeling and validation, imposing a growing "transparency tax" that does not scale with the pace of model development. We introduce Atlas-Alignment, a framework that avoids this cost by aligning the latent space of a new model to a pre-existing, labeled Concept Atlas using only shared inputs and lightweight representational alignment methods. Through quantitative and qualitative evaluations, we show that simple alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept datasets. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in a single high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Atlas-Alignment sketches a transfer framework that reuses one labeled Concept Atlas across models via lightweight alignment, but the abstract gives almost no evidence that the mapping actually preserves concept identity.

read the letter

The paper's main move is to treat interpretability as something you can amortize: build one solid labeled atlas once, then align new models to it with simple methods on shared inputs so you skip per-model labeling and SAE training. That framing of the transparency tax is direct and practical, and the shift away from fully model-specific pipelines is a genuine change in setup compared to most current work.

Referee Report

2 major / 2 minor

Summary. The paper introduces Atlas-Alignment, a framework that aligns the latent space of a new language model to a pre-existing labeled Concept Atlas using only shared inputs and lightweight representational alignment methods. The central claim is that this enables robust semantic retrieval and steerable generation in the target model without requiring model-specific labeled concept datasets, as supported by quantitative and qualitative evaluations.

Significance. If the central claim holds, the work would have substantial significance for mechanistic interpretability by amortizing the cost of high-quality concept labeling across many models, thereby reducing the transparency tax on new model development. The emphasis on lightweight, label-free transfer is practically appealing if the semantic correspondences survive the alignment step.

major comments (2)

[Abstract] Abstract: the claim that quantitative and qualitative evaluations support robust retrieval and steerable generation is stated without any description of the alignment procedure (e.g., linear, CCA, or low-rank), the retrieval or generation metrics, the baselines, or controls for confounds such as input distribution shift. This absence prevents assessment of whether the reported robustness is load-bearing or merely anecdotal.
[§3 (Method) and §4 (Experiments)] §3 (Method) and §4 (Experiments): the central assumption that latent spaces of different models share sufficient semantic geometry for lightweight alignment on shared inputs to transfer concept labels without substantial distortion is not tested. No ablation or counter-example is provided for architectures or training regimes known to produce non-isomorphic encodings; such a test is required because even linear alignments can produce mis-mapped labels when the underlying geometry differs, directly undermining the retrieval and generation results.

minor comments (2)

[Introduction] The term 'Concept Atlas' is used throughout without an explicit formal definition or reference to its construction; a short paragraph or equation defining its structure would improve clarity.
[Figures] Figure captions and axis labels in the qualitative examples should explicitly state the alignment method and the source/target models to allow readers to interpret the visualized correspondences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that quantitative and qualitative evaluations support robust retrieval and steerable generation is stated without any description of the alignment procedure (e.g., linear, CCA, or low-rank), the retrieval or generation metrics, the baselines, or controls for confounds such as input distribution shift. This absence prevents assessment of whether the reported robustness is load-bearing or merely anecdotal.

Authors: We agree that the abstract, being a high-level summary, omits specifics that would aid immediate assessment. In the revised version we will expand the abstract to briefly note the use of linear and low-rank alignment procedures, the primary metrics (retrieval precision@K and steering success rate), the inclusion of matched-input baselines, and explicit controls for input distribution shift via shared prompt sets. These additions will remain concise while directing readers to the full methodological and experimental details in Sections 3 and 4. revision: yes
Referee: [§3 (Method) and §4 (Experiments)] §3 (Method) and §4 (Experiments): the central assumption that latent spaces of different models share sufficient semantic geometry for lightweight alignment on shared inputs to transfer concept labels without substantial distortion is not tested. No ablation or counter-example is provided for architectures or training regimes known to produce non-isomorphic encodings; such a test is required because even linear alignments can produce mis-mapped labels when the underlying geometry differs, directly undermining the retrieval and generation results.

Authors: We acknowledge that a direct test of the shared-geometry assumption across divergent architectures would strengthen the claims. Our current experiments demonstrate successful label transfer on models with comparable training regimes and architectures, supported by quantitative alignment fidelity and downstream retrieval/generation metrics. To address the referee's point, we will add an ablation subsection that includes model pairs with known encoding differences (e.g., contrasting decoder-only vs. encoder-decoder or differently regularized variants) and report cases where alignment quality and label fidelity degrade, thereby clarifying the boundary conditions of Atlas-Alignment. revision: yes

Circularity Check

0 steps flagged

No circularity: Atlas-Alignment uses external alignment on shared inputs with independent empirical validation

full rationale

The paper's central method aligns a new model's latent space to a pre-existing labeled Concept Atlas via lightweight representational alignment on shared inputs alone. The claimed outcomes—robust semantic retrieval and steerable generation—are evaluated quantitatively and qualitatively on downstream tasks that test label transfer fidelity, rather than being defined into existence by the alignment procedure. No equations reduce a result to its own fitted parameters by construction, no load-bearing self-citations justify uniqueness, and no ansatz is smuggled in; the derivation chain remains self-contained against external benchmarks of alignment success and task performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that latent spaces are alignable across models using only shared inputs. The Concept Atlas is introduced as a new entity whose labels become transferable once alignment succeeds. No explicit free parameters are mentioned.

axioms (1)

domain assumption Latent spaces of different language models share sufficient structure that lightweight alignment on shared inputs can transfer semantic labels effectively.
This premise is required for the transfer of interpretability without new labeled data.

invented entities (1)

Concept Atlas no independent evidence
purpose: A pre-existing labeled reference of concepts in latent space that serves as the source for transferred interpretability.
Central new artifact whose creation and reuse enables cost amortization.

pith-pipeline@v0.9.0 · 5695 in / 1267 out tokens · 55932 ms · 2026-05-18T02:50:45.955351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas using only shared inputs and lightweight representational alignment techniques.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Orthogonal Procrustes Translation: Constrains the translation matrix to be orthogonal... After a row-wise L2-normalization of the activations, this is equivalent to minimizing the cosine distance between activations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.