pith. sign in

arxiv: 2605.28896 · v1 · pith:XBNJLRIHnew · submitted 2026-05-27 · 💻 cs.LG

Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

Pith reviewed 2026-06-29 13:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords LoRASparse AutoencodersFine-tuningRepresentational geometryDelta activationsLanguage modelsMechanistic interpretability
0
0 comments X

The pith

LoRA fine-tuning produces feature structures in language models that show weak geometric alignment with pretrained sparse autoencoder dictionaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates the contribution of LoRA adapters to the residual stream by subtracting base-model activations from fine-tuned ones. It then trains new sparse autoencoders on those delta activations across multiple layers and ranks in Gemma-2-9B. These adapter-specific SAEs align only weakly with pretrained SAE features when measured by cosine similarity of decoder directions, principal angles between subspaces, and centered kernel alignment. The adapter SAEs also reconstruct the delta activations more accurately than the pretrained ones do. The results indicate that fine-tuning moves representations into regions of activation space that existing base-model dictionaries do not fully cover.

Core claim

Across layers and ranks, LoRA-induced feature dictionaries exhibit comparatively weak geometric alignment with pretrained SAE features, while adapter-specific SAEs reconstruct delta activations more effectively, indicating that LoRA updates occupy partially distinct representational structure within the residual stream. Feature density increases with rank and depth, yet geometric divergence remains relatively stable across ranks.

What carries the argument

Delta activation framework that isolates adapter-specific contribution to the residual stream by subtracting base-model activations from fine-tuned activations before SAE training.

If this is right

  • Pretrained SAE dictionaries may miss representational changes introduced by LoRA fine-tuning.
  • Adapter-specific SAEs are needed to interpret the internal effects of parameter-efficient adaptation.
  • Feature density rises with increasing LoRA rank and layer depth while geometric divergence stays stable.
  • Mechanistic interpretability and safety auditing of fine-tuned models must account for these partially distinct structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing base-model interpretability pipelines could systematically under-detect behavioral shifts that arise only after adaptation.
  • Safety evaluations performed solely on pretrained dictionaries might overlook new features created during downstream fine-tuning.
  • A practical extension would be to maintain a small library of delta-trained SAEs for common adaptation methods rather than relying on a single pretrained dictionary.

Load-bearing premise

The delta activation framework cleanly isolates adapter-specific contributions without residual confounding from the base model weights or training dynamics.

What would settle it

A replication that measures cosine similarity, principal angles, and CKA between adapter-specific and pretrained SAE decoder directions on the same Gemma-2-9B LoRA models and finds strong rather than weak alignment.

Figures

Figures reproduced from arXiv: 2605.28896 by Prasanth K K.

Figure 1
Figure 1. Figure 1: Delta norm heatmap across layers and ranks. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reconstruction error comparison. and delta magnitude is non-monotonic: r = 8 produces the largest norm at layer 38 (345.45), exceeding r = 32 (330.81). The delta exhibits non-zero variance across all residual dimensions, indicating broad distribution across the residual stream rather than confined to a low-dimensional subspace. 5.2 Base SAE Reconstruction Failure [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reconstruction improvement (%) of delta SAEs over Gemma Scope on held-out data [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Feature density (mean active features per token) across layers and ranks. Left: density [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Feature Overlap [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cosine Similarity 5.5 Dictionary Similarity: Cosine Analysis [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Summary of key findings across all ranks and layers. (A) Delta norm heatmap showing [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Low-Rank Adaptation (LoRA) has emerged as a widely adopted approach for adapting large language models, yet the internal representational changes induced by LoRA fine-tuning remain insufficiently understood. In this work, we investigate the geometry of LoRA-induced representations using Sparse Autoencoders (SAEs). We introduce a delta activation framework that isolates the adapter-specific contribution to the residual stream. Using Gemma-2-9B with LoRA ranks 4, 8, 16, and 32, we train adapter-specific SAEs across multiple transformer layers and compare their learned feature spaces with pretrained SAE dictionaries. We evaluate representational alignment using cosine similarity between decoder directions, principal-angle analysis of feature subspaces, and Centered Kernel Alignment (CKA) between activation representations. Across layers and ranks, we consistently observe comparatively weak geometric alignment between LoRA-induced feature dictionaries and pretrained SAE features. Adapter-specific SAEs also reconstruct delta activations more effectively than pretrained SAEs, suggesting that LoRA updates occupy partially distinct representational structure within the residual stream. Additionally, feature density increases with rank and depth, while geometric divergence remains relatively stable across ranks. These findings provide empirical evidence that LoRA fine-tuning can induce feature structures that are not fully captured by pretrained interpretability dictionaries, with implications for mechanistic interpretability, adaptation analysis, and safety auditing of fine-tuned language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LoRA fine-tuning on Gemma-2-9B induces partially distinct representational structure in the residual stream, as shown by weak geometric alignment (via cosine similarity of decoder directions, principal-angle analysis, and CKA) between adapter-specific SAE features and pretrained SAE dictionaries, plus superior reconstruction of delta activations (finetuned minus base) by adapter-specific SAEs. Feature density increases with rank and depth while geometric divergence remains stable across ranks 4-32 and multiple layers.

Significance. If the central empirical findings hold after addressing controls, the work would demonstrate that pretrained interpretability dictionaries are insufficient for analyzing fine-tuned models and motivate adapter-specific SAEs, with direct relevance to mechanistic interpretability, adaptation analysis, and safety auditing. The use of multiple alignment metrics on real LoRA ranks provides a concrete starting point, though the absence of reported quantitative values limits immediate assessment of effect sizes.

major comments (2)
  1. [Abstract, delta activation framework paragraph] Abstract, delta activation framework paragraph: the claim that delta activations cleanly isolate adapter-specific contributions is load-bearing for the central claim of distinct structure, yet the manuscript supplies no controls (e.g., random perturbations or base-model training noise) for non-linear interactions or input-distribution shifts induced by LoRA in subsequent layers; without such tests it remains possible that observed divergence is an artifact of the subtraction operator on shifted activation manifolds.
  2. [Abstract] Abstract: the statement of 'consistent observations across ranks and layers' is presented without any quantitative values, error bars, layer counts, statistical tests, or SAE hyperparameter controls, so the support for the central claim cannot be evaluated from the provided text.
minor comments (2)
  1. Clarify the exact definition and layer-wise application of the delta activation (finetuned minus base) to avoid ambiguity in how it interacts with residual-stream dependencies.
  2. Add error bars, sample sizes, and p-values to all reported alignment and reconstruction metrics so that claims of 'weak alignment' and 'more effective' reconstruction can be assessed for robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important considerations for the delta activation framework and the presentation of results in the abstract. We address each major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: [Abstract, delta activation framework paragraph] Abstract, delta activation framework paragraph: the claim that delta activations cleanly isolate adapter-specific contributions is load-bearing for the central claim of distinct structure, yet the manuscript supplies no controls (e.g., random perturbations or base-model training noise) for non-linear interactions or input-distribution shifts induced by LoRA in subsequent layers; without such tests it remains possible that observed divergence is an artifact of the subtraction operator on shifted activation manifolds.

    Authors: We agree this is a substantive limitation: the delta activation approach assumes approximate additivity, and non-linear interactions or distribution shifts could contribute to observed differences. While this subtraction is a standard technique in fine-tuning analysis, the concern is valid and the manuscript should not overstate isolation. In revision we will add an explicit limitations paragraph discussing the assumption and include a control analysis applying random perturbations to base-model activations to quantify how much divergence arises from the operator itself versus adapter-induced structure. revision: yes

  2. Referee: [Abstract] Abstract: the statement of 'consistent observations across ranks and layers' is presented without any quantitative values, error bars, layer counts, statistical tests, or SAE hyperparameter controls, so the support for the central claim cannot be evaluated from the provided text.

    Authors: The abstract is a concise summary, but we accept that it should convey more concrete support for the consistency claim. The full manuscript already reports results across ranks 4–32 and multiple layers using three alignment metrics, with tables of mean cosine similarities, principal angles, and CKA values plus standard deviations. In the revised version we will augment the abstract with representative quantitative figures (e.g., average cosine similarity ranges and layer counts) and note that hyperparameter robustness was checked via multiple SAE widths. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons of independently trained SAEs

full rationale

The paper reports direct empirical measurements—cosine similarities between decoder directions, principal angles of subspaces, and CKA between activation representations—computed on SAEs trained separately on delta activations versus pretrained dictionaries. No equations, fitted parameters, or predictions are defined in terms of the target quantities; the delta activation subtraction is a fixed preprocessing step whose outputs are then compared with standard, externally defined metrics. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatzes or renamings reduce the central claims to the inputs by construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or non-standard axioms are stated in the provided text.

axioms (1)
  • domain assumption Sparse autoencoders recover interpretable features from transformer residual stream activations
    Standard premise of the SAE interpretability literature invoked by the choice of method.

pith-pipeline@v0.9.1-grok · 5779 in / 1180 out tokens · 52701 ms · 2026-06-29T13:46:26.009671+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Ainslie, J., Lee-Thorp, J., de Jong, M., Zelaski, T., Sanghai, S., & Xu, Y. (2023). GQA: Training generalised multi-query transformer models from multi-head checkpoints.Proceedings of EMNLP

  2. [2]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/ tatsu-lab/stanford_alpaca

  3. [3]

    G., Portes, J., Paul, M., Greengard, P., Jennings, C., & Frankle, J

    Biderman, D., Ortiz, J. G., Portes, J., Paul, M., Greengard, P., Jennings, C., & Frankle, J. (2024). LoRA learns less and forgets less.Transactions on Machine Learning Research. Bj¨ orck,˚A., & Golub, G. H. (1973). Numerical methods for computing angles between linear subspaces.Mathematics of Computation, 27(123), 579–594

  4. [4]

    & Henighan, T

    Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., . . . & Henighan, T. (2023). Towards monosemanticity: Decomposing language models with dictionary learn- ing.Anthropic Transformer Circuits Thread. https://transformer-circuits.pub/2023/ monosemantic-features

  5. [5]

    Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600

  6. [6]

    Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2024). QLoRA: Efficient finetuning of quantized LLMs.Advances in Neural Information Processing Systems

  7. [7]

    & Olah, C

    Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., . . . & Olah, C. (2022). Toy models of superposition.Transformer Circuits Thread. https://transformer-circuits. pub/2022/toy_model

  8. [8]

    Gemma 2: Improving Open Language Models at a Practical Size

    Evans, O., Cotton-Barratt, O., Finnveden, L., Balesni, M., Balwit, A., Hurst, A., . . . & Saunders, W. (2025). Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. Nature, 642, 1051–1058. Gemma Team. (2024). Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118

  9. [9]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    Lieberum, T., Dunefsky, J., Bloom, J., Bailey, N., Cunningham, H., . . . & Nanda, N. (2024). Gemma Scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147

  10. [10]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models.International Conference on Learning Representations

  11. [11]

    & Sun, M

    Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, D., . . . & Sun, M. (2023). LLM-Adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933

  12. [12]

    Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019). Similarity of neural network represen- tations revisited.International Conference on Machine Learning

  13. [13]

    Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning aligned language models compromises safety, even when users do not intend to.arXiv preprint arXiv:2310.03693. 17

  14. [14]

    Raghu, M., Gilmer, J., Yosinski, J., & Sohl-Dickstein, J. (2017). SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability.Advances in Neural Information Processing Systems

  15. [15]

    & Henighan, T

    Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., . . . & Henighan, T. (2024). Scaling monosemanticity: Extracting interpretable features from Claude 3 Son- net.Anthropic Transformer Circuits Thread. https://transformer-circuits.pub/2024/ scaling-monosemanticity

  16. [16]

    Y., Zhao, X., & Lin, D

    Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W. Y., Zhao, X., & Lin, D. (2023). Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949

  17. [17]

    Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., & Tian, Y. (2024). GaLore: Memory- efficient LLM training by gradient low-rank projection.International Conference on Machine Learning. 18