pith. sign in

arxiv: 2605.15961 · v1 · pith:72ACJEDMnew · submitted 2026-05-15 · 💻 cs.CV

Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

Pith reviewed 2026-05-20 18:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords sparse autoencodersrobust fine-tuningCLIP modelsdistribution shiftsinterpretable representationscatastrophic forgettingvision-language modelsImageNet benchmarks
0
0 comments X

The pith

Sparse autoencoders identify semantic visual features in CLIP to regularize fine-tuning and preserve robustness to distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SAE-FT as a way to fine-tune CLIP models without losing their original robustness. A sparse autoencoder is first trained on the pre-trained visual encoder to locate sparse, semantically meaningful feature directions. During fine-tuning the method adds a penalty that discourages the model from adding or removing those directions in its representations. This constraint limits catastrophic forgetting and removes the need for text-based guidance used in earlier approaches. A reader would care because the technique stays computationally light yet reaches or beats prior results on ImageNet and its shift benchmarks while also making the changes to the model interpretable.

Core claim

SAE-FT regularizes changes to visual representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks.

What carries the argument

Sparse Autoencoder trained on the pre-trained CLIP visual representations to extract interpretable feature directions; these directions supply the regularization term that limits which features may be added or removed during fine-tuning.

If this is right

  • Fine-tuning no longer requires expensive text guidance to retain zero-shot robustness.
  • The regularization produces an explicit record of which semantic features were altered.
  • Performance on ImageNet and shift benchmarks reaches or exceeds prior regularization methods.
  • The visual-only formulation simplifies application to new vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regularization idea could be tested on continual-learning sequences where multiple tasks arrive over time.
  • Because the method tracks feature changes, it could support controlled editing that removes specific unwanted capabilities.
  • If the SAE directions turn out to be stable across model scales, the approach might transfer to larger vision-language systems with little extra cost.

Load-bearing premise

The features found by the sparse autoencoder trained on the pre-trained model are the semantically meaningful directions whose preservation is sufficient to keep robustness against distribution shifts.

What would settle it

Run the SAE-FT procedure on a CLIP model and then measure whether accuracy on distribution-shift test sets falls below the level achieved by unconstrained fine-tuning; if the penalized version loses robustness while the unpenalized version retains it, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.15961 by Ankit Sonthalia, Arnas Uselis, Fabian Morelli, Seong Joon Oh.

Figure 1
Figure 1. Figure 1: Intuition behind SAE-FT. A Sparse Autoencoder trained on the zero-shot model decomposes [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Standard fine-tuning causes the original dictionary to collapse (Fraction of Variance [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Schematic overview of SAE-FT. Changes compared to the zero-shot model are encouraged [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the feature preserva￾tion regularization. New features are penalized (dark orange), while changing the magnitude of existing features is not penalized. This penalty assumes that the representation basis vectors (the neurons) are the fundamental units of meaning (axis alignment). However, in dense models like CLIP, features are often polysemantic and stored in superposition, meaning that in… view at source ↗
Figure 6
Figure 6. Figure 6: Feature re-weighting in SAE-FT. We analyze an image of a pirate ship misclassified by [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the feature-task alignment metric. SAE-FT re-weights feature activations [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: https://github.com/Fabian-Mor/sae-ft.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SAE-FT, a method for robust fine-tuning of CLIP vision-language models. A sparse autoencoder is first trained on activations from the pre-trained visual encoder to discover semantically meaningful features. During downstream fine-tuning, a regularization term penalizes the addition or removal of these specific features in the visual representations. The approach is presented as computationally efficient, interpretable, and free of text guidance; it is claimed to match or exceed prior state-of-the-art results on ImageNet and its distribution-shift benchmarks while mitigating catastrophic forgetting.

Significance. If the empirical claims are substantiated, the work supplies a mechanistically transparent regularization strategy that operates solely on visual features and yields both robustness and interpretability gains. The public code release supports reproducibility. The method could serve as a practical alternative to text-guided approaches and as a tool for analyzing which directions in CLIP’s representation space are critical for out-of-distribution performance.

major comments (2)
  1. [§3 (Method), Eq. (3)–(5)] §3 (Method), Eq. (3)–(5): the central claim attributes robustness preservation to the specific semantic features discovered by the pre-trained SAE. However, no ablation is reported that replaces the SAE-derived mask with either random directions or a generic L1/sparsity penalty on the same activations. Without this control experiment it is impossible to determine whether the SAE training step is load-bearing or whether any sufficiently strong activation regularizer would produce the reported ImageNet and shift results.
  2. [§4 (Experiments)] §4 (Experiments): the abstract and results assert matching or exceeding SOTA performance on ImageNet and distribution-shift suites, yet the manuscript supplies neither error bars across multiple random seeds nor statistical significance tests for the reported gains. Given the known sensitivity of CLIP fine-tuning to hyperparameters and initialization, these omissions weaken confidence that the observed improvements are reliable and attributable to the proposed regularization.
minor comments (2)
  1. [§2–3] Notation for the SAE reconstruction loss and the fine-tuning regularizer should be unified across sections to avoid confusion between the pre-training and fine-tuning stages.
  2. [Figure 2] Figure 2 (feature visualization) would benefit from a side-by-side comparison with a non-SAE baseline to illustrate the claimed semantic specificity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental validation that we address below. We have prepared revisions to strengthen the paper accordingly.

read point-by-point responses
  1. Referee: [§3 (Method), Eq. (3)–(5)] §3 (Method), Eq. (3)–(5): the central claim attributes robustness preservation to the specific semantic features discovered by the pre-trained SAE. However, no ablation is reported that replaces the SAE-derived mask with either random directions or a generic L1/sparsity penalty on the same activations. Without this control experiment it is impossible to determine whether the SAE training step is load-bearing or whether any sufficiently strong activation regularizer would produce the reported ImageNet and shift results.

    Authors: We agree that a direct ablation replacing the SAE-derived mask with random directions or a generic L1 penalty on the same activations would more conclusively demonstrate that the semantic features identified by the pre-trained SAE are load-bearing for the observed robustness. The current manuscript emphasizes the interpretability benefits and shows that the discovered features align with semantic concepts via qualitative inspection, but does not include these specific controls. We will add the requested ablation experiments to the revised version, comparing performance when using random masks versus the SAE mask and a standard sparsity penalty, to isolate the contribution of the SAE training step. revision: yes

  2. Referee: [§4 (Experiments)] §4 (Experiments): the abstract and results assert matching or exceeding SOTA performance on ImageNet and distribution-shift suites, yet the manuscript supplies neither error bars across multiple random seeds nor statistical significance tests for the reported gains. Given the known sensitivity of CLIP fine-tuning to hyperparameters and initialization, these omissions weaken confidence that the observed improvements are reliable and attributable to the proposed regularization.

    Authors: We recognize that the absence of error bars and statistical significance testing limits the strength of the empirical claims, particularly given the known variability in CLIP fine-tuning. The reported results reflect single-run evaluations performed under the computational constraints at submission time. In the revised manuscript we will rerun the key experiments across multiple random seeds, report means with standard deviations as error bars, and include statistical significance tests comparing SAE-FT to the baselines to better substantiate the performance improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: SAE trained externally on pre-trained model and used as independent regularizer

full rationale

The method trains a Sparse Autoencoder on the frozen pre-trained CLIP visual representations to identify features, then applies a penalty on changes to those features during fine-tuning. This separation means the regularization term is defined from an independent pre-training stage rather than being fitted to or defined by the fine-tuning outcomes themselves. No equations reduce the robustness gains to a self-referential fit, no uniqueness theorem is imported from self-citations to force the approach, and the central performance claims rest on empirical results on ImageNet and shift benchmarks rather than any derivation that collapses to its inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that sparse autoencoder features capture the relevant semantics for robustness.

pith-pipeline@v0.9.0 · 5697 in / 1111 out tokens · 47971 ms · 2026-05-20T18:28:59.977648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    Towards monosemanticity: Decomposing language models with dictionary learning

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Concerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayde McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah. Towards mo...

  2. [2]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

  3. [3]

    An analysis of single-layer networks in unsuper- vised feature learning

    Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsuper- vised feature learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 215–223, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR

  4. [4]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

  5. [5]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

  6. [6]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

  7. [7]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

  8. [8]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

  9. [9]

    Finetune like you pretrain: Improved finetuning of zero-shot vision models

    Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023

  10. [10]

    Causal interpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025

    Sangyu Han, Yearim Kim, and Nojun Kwak. Causal interpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025

  11. [11]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  12. [12]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  13. [13]

    Steering CLIP’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

    Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, and Blake Aaron Richards. Steering CLIP’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

  14. [14]

    StarFT: Robust fine-tuning of zero-shot models via spuriosity alignment.arXiv preprint arXiv:2505.13232, 2025

    Younghyun Kim, Jongheon Jeong, Sangkyung Kwak, Kyungmin Lee, Juho Lee, and Jinwoo Shin. StarFT: Robust fine-tuning of zero-shot models via spuriosity alignment.arXiv preprint arXiv:2505.13232, 2025. 14

  15. [15]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  16. [16]

    Earnshaw, Imran S

    Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS...

  17. [17]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning (ICML), pages 3519–3529, 2019

  18. [18]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, pages 32–33, 2009

  19. [19]

    Fine-tuning can distort pretrained features and underperform out-of- distribution

    Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine- tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054, 2022

  20. [20]

    Sparse autoencoders reveal selec- tive remapping of visual concepts during adaptation

    Hyesu Lim, Jinho Choi, Jaegul Choo, and Steffen Schneider. Sparse autoencoders reveal selec- tive remapping of visual concepts during adaptation. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025

  21. [21]

    Context-aware robust fine-tuning.International Journal of Computer Vision, 132(5):1685–1700, 2024

    Xiaofeng Mao, Yufeng Chen, Xiaojun Jia, Rong Zhang, Hui Xue, and Zhao Li. Context-aware robust fine-tuning.International Journal of Computer Vision, 132(5):1685–1700, 2024

  22. [22]

    Jishnu Mukhoti, Yarin Gal, Philip H. S. Torr, and Puneet K. Dokania. Fine-tuning can cripple your foundation model; preserving features may be the solution.arXiv preprint arXiv:2308.13320, 2024

  23. [23]

    Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19, 2011

    Andrew Ng et al. Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19, 2011

  24. [24]

    Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024

    Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024

  25. [25]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  26. [26]

    Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

  27. [27]

    Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023

    Zhouxing Shi, Nicholas Carlini, Ananth Balashankar, Ludwig Schmidt, Cho-Jui Hsieh, Alex Beutel, and Yao Qin. Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023

  28. [28]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features....

  29. [29]

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P. Xing. Learning robust global represen- tations by penalizing local predictive power. InAdvances in Neural Information Processing Systems (NeurIPS), pages 10506–10518, 2019. 15

  30. [30]

    Robust fine-tuning of zero-shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022

  31. [31]

    Explicit inductive bias for transfer learning with convolutional networks

    LI Xuhong, Yves Grandvalet, and Franck Davoine. Explicit inductive bias for transfer learning with convolutional networks. InInternational conference on machine learning, pages 2825–

  32. [32]

    A CKA analysis details The similarity metric CKA, proposed by Kornblith et al

    PMLR, 2018. A CKA analysis details The similarity metric CKA, proposed by Kornblith et al. [17], is a similarity metric that is invariant to orthogonal projections and isotropic scaling, but not to invertible linear functions. AssumingXandYare centered it holds true that: 1 (n−1) 2 tr(XX T Y Y T ) =∥cov(X T , Y T )∥2 F .(12) HSIC generalizes this to inner...