Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

Ankit Sonthalia; Arnas Uselis; Fabian Morelli; Seong Joon Oh

arxiv: 2605.15961 · v1 · pith:72ACJEDMnew · submitted 2026-05-15 · 💻 cs.CV

Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

Fabian Morelli , Arnas Uselis , Ankit Sonthalia , Seong Joon Oh This is my paper

Pith reviewed 2026-05-20 18:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords sparse autoencodersrobust fine-tuningCLIP modelsdistribution shiftsinterpretable representationscatastrophic forgettingvision-language modelsImageNet benchmarks

0 comments

The pith

Sparse autoencoders identify semantic visual features in CLIP to regularize fine-tuning and preserve robustness to distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SAE-FT as a way to fine-tune CLIP models without losing their original robustness. A sparse autoencoder is first trained on the pre-trained visual encoder to locate sparse, semantically meaningful feature directions. During fine-tuning the method adds a penalty that discourages the model from adding or removing those directions in its representations. This constraint limits catastrophic forgetting and removes the need for text-based guidance used in earlier approaches. A reader would care because the technique stays computationally light yet reaches or beats prior results on ImageNet and its shift benchmarks while also making the changes to the model interpretable.

Core claim

SAE-FT regularizes changes to visual representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks.

What carries the argument

Sparse Autoencoder trained on the pre-trained CLIP visual representations to extract interpretable feature directions; these directions supply the regularization term that limits which features may be added or removed during fine-tuning.

If this is right

Fine-tuning no longer requires expensive text guidance to retain zero-shot robustness.
The regularization produces an explicit record of which semantic features were altered.
Performance on ImageNet and shift benchmarks reaches or exceeds prior regularization methods.
The visual-only formulation simplifies application to new vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization idea could be tested on continual-learning sequences where multiple tasks arrive over time.
Because the method tracks feature changes, it could support controlled editing that removes specific unwanted capabilities.
If the SAE directions turn out to be stable across model scales, the approach might transfer to larger vision-language systems with little extra cost.

Load-bearing premise

The features found by the sparse autoencoder trained on the pre-trained model are the semantically meaningful directions whose preservation is sufficient to keep robustness against distribution shifts.

What would settle it

Run the SAE-FT procedure on a CLIP model and then measure whether accuracy on distribution-shift test sets falls below the level achieved by unconstrained fine-tuning; if the penalized version loses robustness while the unpenalized version retains it, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.15961 by Ankit Sonthalia, Arnas Uselis, Fabian Morelli, Seong Joon Oh.

**Figure 2.** Figure 2: Standard fine-tuning causes the original dictionary to collapse (Fraction of Variance [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Schematic overview of SAE-FT. Changes compared to the zero-shot model are encouraged [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the feature preservation regularization. New features are penalized (dark orange), while changing the magnitude of existing features is not penalized. This penalty assumes that the representation basis vectors (the neurons) are the fundamental units of meaning (axis alignment). However, in dense models like CLIP, features are often polysemantic and stored in superposition, meaning that in… view at source ↗

**Figure 6.** Figure 6: Feature re-weighting in SAE-FT. We analyze an image of a pirate ship misclassified by [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the feature-task alignment metric. SAE-FT re-weights feature activations [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: https://github.com/Fabian-Mor/sae-ft.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAE-FT gives a text-free way to regularize CLIP fine-tuning by penalizing changes to features found by a pre-trained sparse autoencoder, which is a clean practical idea, but the results do not yet show that the SAE semantics are doing more than generic sparsity would.

read the letter

The main takeaway is that this work uses a sparse autoencoder trained on the original CLIP visual encoder to add a regularization term during fine-tuning. The penalty discourages adding or dropping the features the SAE has identified, with the goal of keeping robustness on distribution shifts while staying interpretable and avoiding text prompts entirely. That setup is new enough in the fine-tuning literature to be worth attention, and releasing the code helps anyone who wants to reproduce or extend it. The reported numbers on ImageNet and the shift suites look competitive with prior methods, which at least shows the approach is workable in practice. The interpretability claim is also straightforward: you can inspect which SAE features change and tie them to semantic concepts. That part feels honest and useful for downstream analysis. The softer spot is the causal link the paper wants to draw. The central story is that preserving the specific SAE-discovered directions is what stops forgetting and preserves robustness. Yet the experiments do not appear to include a clean ablation against a simpler L1 or sparsity penalty applied to the same activations. If those baselines perform similarly, then the SAE training step and the semantic interpretation become secondary rather than load-bearing. The assumption that the SAE features are precisely the ones that matter for shift robustness also sits on limited direct evidence so far. It is plausible, but showing that targeted edits to those features move robustness in the expected direction would make the argument tighter. This is the kind of paper that fits a reading group focused on practical multimodal adaptation or regularization techniques. Readers who care about text-free methods and some built-in explainability will get value from the method and the code. It is coherent enough on its own terms to deserve real referee time rather than a desk reject, even if the strongest claims will need more targeted controls in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SAE-FT, a method for robust fine-tuning of CLIP vision-language models. A sparse autoencoder is first trained on activations from the pre-trained visual encoder to discover semantically meaningful features. During downstream fine-tuning, a regularization term penalizes the addition or removal of these specific features in the visual representations. The approach is presented as computationally efficient, interpretable, and free of text guidance; it is claimed to match or exceed prior state-of-the-art results on ImageNet and its distribution-shift benchmarks while mitigating catastrophic forgetting.

Significance. If the empirical claims are substantiated, the work supplies a mechanistically transparent regularization strategy that operates solely on visual features and yields both robustness and interpretability gains. The public code release supports reproducibility. The method could serve as a practical alternative to text-guided approaches and as a tool for analyzing which directions in CLIP’s representation space are critical for out-of-distribution performance.

major comments (2)

[§3 (Method), Eq. (3)–(5)] §3 (Method), Eq. (3)–(5): the central claim attributes robustness preservation to the specific semantic features discovered by the pre-trained SAE. However, no ablation is reported that replaces the SAE-derived mask with either random directions or a generic L1/sparsity penalty on the same activations. Without this control experiment it is impossible to determine whether the SAE training step is load-bearing or whether any sufficiently strong activation regularizer would produce the reported ImageNet and shift results.
[§4 (Experiments)] §4 (Experiments): the abstract and results assert matching or exceeding SOTA performance on ImageNet and distribution-shift suites, yet the manuscript supplies neither error bars across multiple random seeds nor statistical significance tests for the reported gains. Given the known sensitivity of CLIP fine-tuning to hyperparameters and initialization, these omissions weaken confidence that the observed improvements are reliable and attributable to the proposed regularization.

minor comments (2)

[§2–3] Notation for the SAE reconstruction loss and the fine-tuning regularizer should be unified across sections to avoid confusion between the pre-training and fine-tuning stages.
[Figure 2] Figure 2 (feature visualization) would benefit from a side-by-side comparison with a non-SAE baseline to illustrate the claimed semantic specificity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental validation that we address below. We have prepared revisions to strengthen the paper accordingly.

read point-by-point responses

Referee: [§3 (Method), Eq. (3)–(5)] §3 (Method), Eq. (3)–(5): the central claim attributes robustness preservation to the specific semantic features discovered by the pre-trained SAE. However, no ablation is reported that replaces the SAE-derived mask with either random directions or a generic L1/sparsity penalty on the same activations. Without this control experiment it is impossible to determine whether the SAE training step is load-bearing or whether any sufficiently strong activation regularizer would produce the reported ImageNet and shift results.

Authors: We agree that a direct ablation replacing the SAE-derived mask with random directions or a generic L1 penalty on the same activations would more conclusively demonstrate that the semantic features identified by the pre-trained SAE are load-bearing for the observed robustness. The current manuscript emphasizes the interpretability benefits and shows that the discovered features align with semantic concepts via qualitative inspection, but does not include these specific controls. We will add the requested ablation experiments to the revised version, comparing performance when using random masks versus the SAE mask and a standard sparsity penalty, to isolate the contribution of the SAE training step. revision: yes
Referee: [§4 (Experiments)] §4 (Experiments): the abstract and results assert matching or exceeding SOTA performance on ImageNet and distribution-shift suites, yet the manuscript supplies neither error bars across multiple random seeds nor statistical significance tests for the reported gains. Given the known sensitivity of CLIP fine-tuning to hyperparameters and initialization, these omissions weaken confidence that the observed improvements are reliable and attributable to the proposed regularization.

Authors: We recognize that the absence of error bars and statistical significance testing limits the strength of the empirical claims, particularly given the known variability in CLIP fine-tuning. The reported results reflect single-run evaluations performed under the computational constraints at submission time. In the revised manuscript we will rerun the key experiments across multiple random seeds, report means with standard deviations as error bars, and include statistical significance tests comparing SAE-FT to the baselines to better substantiate the performance improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: SAE trained externally on pre-trained model and used as independent regularizer

full rationale

The method trains a Sparse Autoencoder on the frozen pre-trained CLIP visual representations to identify features, then applies a penalty on changes to those features during fine-tuning. This separation means the regularization term is defined from an independent pre-training stage rather than being fitted to or defined by the fine-tuning outcomes themselves. No equations reduce the robustness gains to a self-referential fit, no uniqueness theorem is imported from self-citations to force the approach, and the central performance claims rest on empirical results on ImageNet and shift benchmarks rather than any derivation that collapses to its inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that sparse autoencoder features capture the relevant semantics for robustness.

pith-pipeline@v0.9.0 · 5697 in / 1111 out tokens · 47971 ms · 2026-05-20T18:28:59.977648+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Lresid :=||Δr−W_d(Δs)||²₂ ... Ladd :=λresid Lresid + λadd 1/p ∑(1−m_k)|s_ft_k|

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

[1]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Concerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayde McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah. Towards mo...

work page 2023
[2]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

work page 2014
[3]

An analysis of single-layer networks in unsuper- vised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsuper- vised feature learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 215–223, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR

work page 2011
[4]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

work page 2009
[6]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

work page 2022
[7]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

work page 2004
[8]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Finetune like you pretrain: Improved finetuning of zero-shot vision models

Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023

work page 2023
[10]

Causal interpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025

Sangyu Han, Yearim Kim, and Nojun Kwak. Causal interpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025

work page arXiv 2025
[11]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021
[12]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[13]

Steering CLIP’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, and Blake Aaron Richards. Steering CLIP’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

work page arXiv 2025
[14]

StarFT: Robust fine-tuning of zero-shot models via spuriosity alignment.arXiv preprint arXiv:2505.13232, 2025

Younghyun Kim, Jongheon Jeong, Sangkyung Kwak, Kyungmin Lee, Juho Lee, and Jinwoo Shin. StarFT: Robust fine-tuning of zero-shot models via spuriosity alignment.arXiv preprint arXiv:2505.13232, 2025. 14

work page arXiv 2025
[15]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[16]

Earnshaw, Imran S

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS...

work page 2021
[17]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning (ICML), pages 3519–3529, 2019

work page 2019
[18]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, pages 32–33, 2009

work page 2009
[19]

Fine-tuning can distort pretrained features and underperform out-of- distribution

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine- tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054, 2022

work page arXiv 2022
[20]

Sparse autoencoders reveal selec- tive remapping of visual concepts during adaptation

Hyesu Lim, Jinho Choi, Jaegul Choo, and Steffen Schneider. Sparse autoencoders reveal selec- tive remapping of visual concepts during adaptation. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025

work page 2025
[21]

Context-aware robust fine-tuning.International Journal of Computer Vision, 132(5):1685–1700, 2024

Xiaofeng Mao, Yufeng Chen, Xiaojun Jia, Rong Zhang, Hui Xue, and Zhao Li. Context-aware robust fine-tuning.International Journal of Computer Vision, 132(5):1685–1700, 2024

work page 2024
[22]

Jishnu Mukhoti, Yarin Gal, Philip H. S. Torr, and Puneet K. Dokania. Fine-tuning can cripple your foundation model; preserving features may be the solution.arXiv preprint arXiv:2308.13320, 2024

work page arXiv 2024
[23]

Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19, 2011

Andrew Ng et al. Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19, 2011

work page 2011
[24]

Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024

work page 2024
[25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[26]

Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

work page 2019
[27]

Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023

Zhouxing Shi, Nicholas Carlini, Ananth Balashankar, Ludwig Schmidt, Cho-Jui Hsieh, Alex Beutel, and Yao Qin. Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023

work page 2023
[28]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P. Xing. Learning robust global represen- tations by penalizing local predictive power. InAdvances in Neural Information Processing Systems (NeurIPS), pages 10506–10518, 2019. 15

work page 2019
[30]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022

work page 2022
[31]

Explicit inductive bias for transfer learning with convolutional networks

LI Xuhong, Yves Grandvalet, and Franck Davoine. Explicit inductive bias for transfer learning with convolutional networks. InInternational conference on machine learning, pages 2825–

work page
[32]

A CKA analysis details The similarity metric CKA, proposed by Kornblith et al

PMLR, 2018. A CKA analysis details The similarity metric CKA, proposed by Kornblith et al. [17], is a similarity metric that is invariant to orthogonal projections and isotropic scaling, but not to invertible linear functions. AssumingXandYare centered it holds true that: 1 (n−1) 2 tr(XX T Y Y T ) =∥cov(X T , Y T )∥2 F .(12) HSIC generalizes this to inner...

work page 2018

[1] [1]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Concerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayde McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah. Towards mo...

work page 2023

[2] [2]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014

work page 2014

[3] [3]

An analysis of single-layer networks in unsuper- vised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsuper- vised feature learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 215–223, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR

work page 2011

[4] [4]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

work page 2009

[6] [6]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

work page 2022

[7] [7]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004

work page 2004

[8] [8]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Finetune like you pretrain: Improved finetuning of zero-shot vision models

Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023

work page 2023

[10] [10]

Causal interpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025

Sangyu Han, Yearim Kim, and Nojun Kwak. Causal interpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025

work page arXiv 2025

[11] [11]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021

[12] [12]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[13] [13]

Steering CLIP’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, and Blake Aaron Richards. Steering CLIP’s vision transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

work page arXiv 2025

[14] [14]

StarFT: Robust fine-tuning of zero-shot models via spuriosity alignment.arXiv preprint arXiv:2505.13232, 2025

Younghyun Kim, Jongheon Jeong, Sangkyung Kwak, Kyungmin Lee, Juho Lee, and Jinwoo Shin. StarFT: Robust fine-tuning of zero-shot models via spuriosity alignment.arXiv preprint arXiv:2505.13232, 2025. 14

work page arXiv 2025

[15] [15]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017

[16] [16]

Earnshaw, Imran S

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS...

work page 2021

[17] [17]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning (ICML), pages 3519–3529, 2019

work page 2019

[18] [18]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, pages 32–33, 2009

work page 2009

[19] [19]

Fine-tuning can distort pretrained features and underperform out-of- distribution

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine- tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054, 2022

work page arXiv 2022

[20] [20]

Sparse autoencoders reveal selec- tive remapping of visual concepts during adaptation

Hyesu Lim, Jinho Choi, Jaegul Choo, and Steffen Schneider. Sparse autoencoders reveal selec- tive remapping of visual concepts during adaptation. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025

work page 2025

[21] [21]

Context-aware robust fine-tuning.International Journal of Computer Vision, 132(5):1685–1700, 2024

Xiaofeng Mao, Yufeng Chen, Xiaojun Jia, Rong Zhang, Hui Xue, and Zhao Li. Context-aware robust fine-tuning.International Journal of Computer Vision, 132(5):1685–1700, 2024

work page 2024

[22] [22]

Jishnu Mukhoti, Yarin Gal, Philip H. S. Torr, and Puneet K. Dokania. Fine-tuning can cripple your foundation model; preserving features may be the solution.arXiv preprint arXiv:2308.13320, 2024

work page arXiv 2024

[23] [23]

Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19, 2011

Andrew Ng et al. Sparse autoencoder.CS294A Lecture notes, 72(2011):1–19, 2011

work page 2011

[24] [24]

Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37:12677–12707, 2024

work page 2024

[25] [25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[26] [26]

Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? InProceedings of the 36th International Conference on Machine Learning (ICML), pages 5389–5400, 2019

work page 2019

[27] [27]

Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023

Zhouxing Shi, Nicholas Carlini, Ananth Balashankar, Ludwig Schmidt, Cho-Jui Hsieh, Alex Beutel, and Yao Qin. Effective robustness against natural distribution shifts for models with different training data.Advances in Neural Information Processing Systems, 36:73543–73558, 2023

work page 2023

[28] [28]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P. Xing. Learning robust global represen- tations by penalizing local predictive power. InAdvances in Neural Information Processing Systems (NeurIPS), pages 10506–10518, 2019. 15

work page 2019

[30] [30]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022

work page 2022

[31] [31]

Explicit inductive bias for transfer learning with convolutional networks

LI Xuhong, Yves Grandvalet, and Franck Davoine. Explicit inductive bias for transfer learning with convolutional networks. InInternational conference on machine learning, pages 2825–

work page

[32] [32]

A CKA analysis details The similarity metric CKA, proposed by Kornblith et al

PMLR, 2018. A CKA analysis details The similarity metric CKA, proposed by Kornblith et al. [17], is a similarity metric that is invariant to orthogonal projections and isotropic scaling, but not to invertible linear functions. AssumingXandYare centered it holds true that: 1 (n−1) 2 tr(XX T Y Y T ) =∥cov(X T , Y T )∥2 F .(12) HSIC generalizes this to inner...

work page 2018