arxiv: 2604.05724 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

Yusung Ro , Jaehyun Choi , Junmo Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords sparse autoencodersCLIP vision encodersinformation scopeinterpretabilitycontextual dependency scorefeature analysisvision transformers

0 comments

The pith

Sparse autoencoders for CLIP separate features by whether they capture local patch information or global image signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that information scope, in addition to semantic meaning, is an important property of features learned by sparse autoencoders on CLIP vision encoders. Some features respond the same way no matter where an object appears in the image, pointing to a global scope, while others only activate for specific local positions. By introducing the Contextual Dependency Score, the work shows these scope differences cause distinct effects on what the model predicts and how confident it is. Understanding this helps diagnose why certain features contribute to decisions in ways that go beyond their apparent meaning.

Core claim

The authors claim that SAE features in CLIP can be characterized by their information scope, defined as the breadth of visual evidence they aggregate, and that this scope can be quantified using the Contextual Dependency Score based on response consistency under spatial perturbations, leading to the separation of local-scope features that are positionally stable from global-scope features that are positionally variant, with each type influencing CLIP predictions differently.

What carries the argument

The Contextual Dependency Score, which measures how much a feature's activation changes with spatial shifts in the input image to determine if its scope is local (stable) or global (variant).

If this is right

Local-scope features maintain consistent responses across positional changes in the image.
Global-scope features exhibit unpredictable shifts in activation with minor input perturbations.
Features of different scopes produce systematically different effects on the model's classification predictions and confidence scores.
The scope distinction supplies a diagnostic axis for examining SAE-derived features beyond their semantic content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scope analysis could guide feature selection in interpretability methods for other vision encoders.
Testing scope on downstream tasks might reveal whether it affects generalization or transfer performance.
Pairing scope measurements with semantic labels could produce more complete maps of what representations encode.

Load-bearing premise

That the consistency of a feature's response to spatial perturbations directly reflects its information scope rather than other factors like robustness or sparsity patterns.

What would settle it

A test where images are manipulated to move objects while keeping semantics fixed, checking if features labeled local remain stable in activation and global ones do not.

Figures

Figures reproduced from arXiv: 2604.05724 by Jaehyun Choi, Junmo Kim, Yusung Ro.

**Figure 1.** Figure 1: Spatial instability of CLIP outlier tokens under Shifted Context Cropping (SCC). We analyze the spatial consistency of patch-token L2 norms using two overlapping crops (red: top-left; blue: bottom-right) translated by a single patch. The heatmaps visualize patch-token L2 norms for each crop across different CLIP ViT models. The green and yellow spots, indicating outlier tokens, do not maintain consistent s… view at source ↗

**Figure 2.** Figure 2: Framework for quantifying our Contextual Dependency Score (CDS) of SAE features. We evaluate the contextual dependency of SAE features using CDS. This score is derived from the Earth Mover’s Distance (EMD) computed between the normalized activation maps of two shifted input crops (red and blue) as shown for a single pair. The final CDS for a feature is the average of these EMD scores over kCDS representat… view at source ↗

**Figure 3.** Figure 3: Distribution of Contextual Dependency Score (CDS). Histogram of CDS values for SAE features across three CLIP models, showing a dominant low-CDS peak together with distinct peaks in the high-CDS region. This distribution motivates partitioning SAE features into low-CDS and high-CDS groups [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Representative activation patterns of low-CDS and high-CDS SAE features. Low-CDS features (left) show spatially grounded and semantically consistent responses, whereas highCDS features (right) exhibit more diffuse activations and weaker semantic consistency. is harmful for the smaller models, indicating that low-CDS features provide information that is useful for classification. The CLIP-L/14-336px result… view at source ↗

**Figure 5.** Figure 5: Visualization of per-patch L2 norms for the original and feature-removed embeddings. We visualize the per-patch L2 norms of the original embedding, the low-CDS-removed embedding, and the high-CDS-removed embedding. The comparison highlights how each feature group affects outlier-token norms and the overall spatial distribution of patch norms across the three CLIP models. line. One possible explanation is t… view at source ↗

**Figure 6.** Figure 6: Distribution of Contextual Dependency Score (CDS). Histogram of CDS values for SAE features across DINOv2, SigLIP2, and DeiT3 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Contextual dependency across the norm spectrum. We plot the Activation-Weighted CDS (awCDS) across the token norm spectrum. Across all architectures, awCDS spikes exponentially for outlier tokens. D.2. Feature-Group Removal Analysis Across ViTs Having established in D.1 that other ViT families also exhibit a meaningful CDS-based feature separation, and that outlier tokens remain spatially and representat… view at source ↗

**Figure 8.** Figure 8: Linear probe accuracy across ViT scales on ImageNet-1K. The high-CDS-removed embedding (blue) consistently outperforms the Baseline (green) across all architectures, indicating that local signals are more discriminative for classification. Conversely, the performance gap of the low-CDS-removed embedding (red) generally narrows as model size increases, suggesting that larger models encode broader context m… view at source ↗

**Figure 9.** Figure 9: Visualization of low-CDS features. These features exhibit strong spatial grounding, consistently localizing specific visual concepts across different images [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of high-CDS features. These features exhibit diffuse activation patterns and capture broader contextual information rather than localized visual details [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpreting the internal representations of CLIP vision encoders, yet existing analyses largely focus on the semantic meaning of individual features. We introduce information scope as a complementary dimension of interpretability that characterizes how broadly an SAE feature aggregates visual evidence, ranging from localized, patch-specific cues to global, image-level signals. We observe that some SAE features respond consistently across spatial perturbations, while others shift unpredictably with minor input changes, indicating a fundamental distinction in their underlying scope. To quantify this, we propose the Contextual Dependency Score (CDS), which separates positionally stable local scope features from positionally variant global scope features. Our experiments show that features of different information scopes exert systematically different influences on CLIP's predictions and confidence. These findings establish information scope as a critical new axis for understanding CLIP representations and provide a deeper diagnostic view of SAE-derived features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces information scope and CDS for SAE features in CLIP but the abstract supplies no numbers, controls, or tests to support the claims.

read the letter

This paper wants to move past semantic labels for SAE features in CLIP and add a second axis: how broadly each feature draws on visual evidence. They call it information scope and measure it with the Contextual Dependency Score, which splits features into positionally stable ones (local) and positionally variant ones (global) based on spatial perturbations. The abstract reports that these groups then show different effects on CLIP predictions and confidence. That framing is new enough on its own terms and could give interpretability work a useful extra handle if it holds up. The basic observation that some features stay consistent while others flip with small shifts is at least a starting point worth checking. The rest of the contribution is thin. The abstract contains no quantitative results, no description of the perturbation method, no statistical tests, and no attempt to separate the claimed scope effect from simpler alternatives like feature robustness or activation patterns. The mapping from response consistency to local versus global scope therefore sits on an untested assumption. If a feature is simply more robust to input noise, it will look stable without being narrowly local. The paper would need explicit controls or ablations to rule that out before the distinction counts as a distinct axis rather than a side effect. Without those steps the central claim stays unsupported. This is aimed at the small group already running SAEs on vision encoders and looking for new diagnostics. A reader already working in that area might borrow the perturbation idea and test it themselves. Everyone else will find the lack of detail a barrier. I would not send it to peer review yet. It needs the missing methods, numbers, and confound checks before it earns referee time.

Referee Report

2 major / 2 minor

Summary. The paper introduces 'information scope' as a complementary interpretability axis for SAE features in CLIP vision encoders, beyond semantic content. It observes that some features respond consistently to spatial perturbations (indicating local, patch-specific scope) while others vary (indicating global scope), and proposes the Contextual Dependency Score (CDS) to quantify and separate these. Experiments claim to show that features of differing scopes exert systematically different influences on CLIP predictions and confidence.

Significance. If validated with controls, the work could add a useful diagnostic dimension for analyzing how CLIP aggregates visual evidence in SAE decompositions. The attempt to move beyond pure semantics is timely, but the current presentation leaves the core distinction empirically underspecified.

major comments (2)

[Abstract] Abstract: the central claim that 'features of different information scopes exert systematically different influences on CLIP's predictions' is stated without any quantitative results, controls, statistical tests, or implementation details for CDS, leaving the empirical support unverifiable.
[Abstract / CDS definition] CDS definition (abstract and method): the interpretation that positional stability under spatial perturbations indicates local scope while inconsistency indicates global scope is not secured against confounds such as differences in feature robustness to small changes or activation sparsity patterns. These alternatives could produce the observed consistency split without any difference in the breadth of visual evidence aggregated, undermining the claim that information scope is a distinct axis.

minor comments (2)

The manuscript should include the explicit formula for CDS, any pseudocode for the perturbation procedure, and details on the SAE training and CLIP model used.
Clarify whether 'systematic differences in influence' are measured via ablation, attribution, or another method, and report effect sizes or confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our empirical claims and the robustness of the Contextual Dependency Score definition. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'features of different information scopes exert systematically different influences on CLIP's predictions' is stated without any quantitative results, controls, statistical tests, or implementation details for CDS, leaving the empirical support unverifiable.

Authors: We agree that the abstract would be strengthened by including quantitative details to make the central claim more verifiable on first reading. In the revised manuscript we have updated the abstract to report key quantitative results from our experiments (including average differences in prediction influence and associated statistical tests) along with a brief note on CDS computation. Full implementation details, controls, and all statistical analyses remain in the Methods and Experiments sections. revision: yes
Referee: [Abstract / CDS definition] CDS definition (abstract and method): the interpretation that positional stability under spatial perturbations indicates local scope while inconsistency indicates global scope is not secured against confounds such as differences in feature robustness to small changes or activation sparsity patterns. These alternatives could produce the observed consistency split without any difference in the breadth of visual evidence aggregated, undermining the claim that information scope is a distinct axis.

Authors: This concern about alternative explanations is well-taken. While the core experiments demonstrate that scope-differentiated features produce systematically different effects on CLIP predictions and confidence (effects that are not obviously predicted by robustness or sparsity differences alone), we acknowledge that the CDS definition itself does not explicitly rule out these confounds. In the revision we have added a dedicated discussion of these alternatives and included post-hoc controls that match features on sparsity; the positional-stability distinction and its downstream effects on model behavior persist under these controls. We have also noted the limits of current controls as an area for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; CDS is an explicit new definition with independent experimental follow-up

full rationale

The paper defines information scope via direct observation of SAE feature responses to spatial perturbations and introduces CDS as a quantitative separator based on that observed consistency/variance. This is a self-contained definitional step rather than a reduction of any claimed prediction or result to a fitted parameter, prior self-citation, or ansatz. Subsequent claims about differential influence on CLIP predictions are presented as separate empirical measurements, not forced by the CDS construction itself. No equations, uniqueness theorems, or self-citation chains are shown to render the central axis equivalent to its inputs by construction. The derivation remains independent of the specific interpretive mapping from consistency to scope.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested premise that positional stability under perturbations equates to local information scope. No free parameters are described. Two new entities are introduced without independent evidence outside the paper.

axioms (1)

domain assumption Positional stability of SAE feature responses across spatial perturbations indicates local scope while instability indicates global scope.
Invoked to justify the separation performed by CDS.

invented entities (2)

information scope no independent evidence
purpose: A complementary dimension to semantic meaning that characterizes how broadly an SAE feature aggregates visual evidence.
New interpretability axis introduced in the abstract.
Contextual Dependency Score (CDS) no independent evidence
purpose: Metric that separates positionally stable local-scope features from positionally variant global-scope features.
Proposed quantification method.

pith-pipeline@v0.9.0 · 5457 in / 1372 out tokens · 72247 ms · 2026-05-10T19:05:40.434924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 16 canonical work pages · 6 internal anchors

[1]

Spatio-temporal con- volutional sparse auto-encoder for sequence classification

Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. Spatio-temporal con- volutional sparse auto-encoder for sequence classification. In BMVC, pages 1–12, 2012. 1

2012
[2]

Registers in small vision trans- formers: A reproducibility study of vision transformers need registers.Transactions on Machine Learning Research

Linus Ruben Bach, Emma Bakker, R ´enan van Dijk, Jip de Vries, and Konrad Szewczyk. Registers in small vision trans- formers: A reproducibility study of vision transformers need registers.Transactions on Machine Learning Research. 3
[3]

Adabins: Depth estimation using adaptive bins

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021. 2

2021
[4]

Towards monosemanticity: Decomposing language mod- els with dictionary learning.Transformer Circuits Thread,

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yi- fan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Ka- rina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, an...
[5]

https://transformer-circuits.pub/2023/monosemantic- features/index.html. 1

2023
[6]

arXiv preprint arXiv:2412.06410 , year=

Bart Bussmann, Patrick Leask, and Neel Nanda. Batch- topk sparse autoencoders.arXiv preprint arXiv:2412.06410,

work page arXiv
[7]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Vision Transformers Need Registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 1, 3, 6

work page internal anchor Pith review arXiv 2023
[9]

Case study: Interpreting, ma- nipulating, and controlling clip with sparse au- toencoders, 2024.URL https://www

Gytis Daujotas. Case study: Interpreting, ma- nipulating, and controlling clip with sparse au- toencoders, 2024.URL https://www. lesswrong. com/posts/iYFuZo9BMvr6GgMs5/case-study-interpreting- manipulating-and-controlling-clip. Accessed, pages 09–24,

2024
[10]

Interpreting and steering fea- tures in images.URL https://www

Gytis Daujotas. Interpreting and steering fea- tures in images.URL https://www. lesswrong. com/posts/Quqekpvx8BGMMcaem/interpreting- andsteering-features-in-images, 4, 2024. 1, 2

2024
[11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5, 6

2009
[12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[13]

Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 2

2014
[14]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review arXiv
[15]

Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024.URL https://www

Hugo Fry. Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024.URL https://www. lesswrong. com/posts/bCtbuWraqYTDtuARg/towards-multimodal- interpretability-learning-sparse-2# Future Work. Accessed, pages 06–28, 2024. 2

2024
[16]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupr ´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024. 1, 3

work page internal anchor Pith review arXiv 2024
[17]

Causal in- terpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025

Sangyu Han, Yearim Kim, and Nojun Kwak. Causal in- terpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025. 1

work page arXiv 2025
[18]

and Lubana, Ekdeep Singh and Fel, Thomas and Ba, Demba , year = 2025, month = mar, number =

Sai Sumedh R Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The duality between sparse autoencoders and concept geometry.arXiv preprint arXiv:2503.01822, 2025. 1

work page arXiv 2025
[19]

Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025

Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandels- man. Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025. 1, 3, 6

work page arXiv 2025
[20]

Sparse autoencoders reveal selective remapping of visual concepts during adaptation.arXiv preprint arXiv:2412.05276,

Hyesu Lim, Jinho Choi, Jaegul Choo, and Steffen Schnei- der. Sparse autoencoders reveal selective remapping of visual concepts during adaptation.arXiv preprint arXiv:2412.05276, 2024. 2

work page arXiv 2024
[21]

Prun- ing the paradox: How clip’s most informative heads en- hance performance while amplifying bias.arXiv preprint arXiv:2503.11103, 2025

Avinash Madasu, Vasudev Lal, and Phillip Howard. Prun- ing the paradox: How clip’s most informative heads en- hance performance while amplifying bias.arXiv preprint arXiv:2503.11103, 2025. 6

work page arXiv 2025
[22]

Indoor segmentation and support inference from rgbd images

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012. 6

2012
[23]

Feature visualization.Distill, 2017

Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2017. https://distill.pub/2017/feature-visualization. 1

2017
[24]

Sparse coding with an overcomplete basis set: A strategy employed by v1?Vision research, 37(23):3311–3325, 1997

Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1?Vision research, 37(23):3311–3325, 1997. 3

1997
[25]

Probing the representational power of sparse autoencoders in vision models

Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Chang- bai Li, Phillip Howard, Vasudev Lal, and Shao-Yen Tseng. Probing the representational power of sparse autoencoders in vision models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 6167–6177,
[26]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

arXiv preprint arXiv:2504.02821 , year=

Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models.arXiv preprint arXiv:2504.02821, 2025. 2

work page arXiv 2025
[28]

De- coding dense embeddings: Sparse autoencoders for inter- preting and discretizing dense retrieval.arXiv preprint arXiv:2506.00041, 2025

Seongwan Park, Taeklim Kim, and Youngjoong Ko. De- coding dense embeddings: Sparse autoencoders for inter- preting and discretizing dense retrieval.arXiv preprint arXiv:2506.00041, 2025. 1

work page arXiv 2025
[29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5

2021
[30]

Discover-then-name: Task-agnostic concept bottle- necks via automated concept discovery

Sukrut Rao, Sweta Mahajan, Moritz B ¨ohle, and Bernt Schiele. Discover-then-name: Task-agnostic concept bottle- necks via automated concept discovery. InEuropean Con- ference on Computer Vision, pages 444–461. Springer, 2024. 1, 2

2024
[31]

The earth mover’s distance as a metric for image retrieval.In- ternational journal of computer vision, 40(2):99–121, 2000

Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval.In- ternational journal of computer vision, 40(2):99–121, 2000. 5

2000
[32]

Sparse autoencoders for scientifically rigorous interpretation of vision models.arXiv preprint arXiv:2502.06755,

Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, and Yu Su. Sparse autoencoders for scientifically rigorous interpre- tation of vision models.arXiv preprint arXiv:2502.06755,

work page arXiv
[33]

Universal sparse au- toencoders: Interpretable cross-model concept alignment

Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, and Konstantinos G Derpanis. Universal sparse au- toencoders: Interpretable cross-model concept alignment. In Forty-second International Conference on Machine Learn- ing, 2025. 1, 2

2025
[34]

Deit iii: Revenge of the vit

Hugo Touvron, Matthieu Cord, and Herv ´e J ´egou. Deit iii: Revenge of the vit. InEuropean conference on computer vision, pages 516–533. Springer, 2022. 1

2022
[35]

Siglip 2: Multilingual vision-language en- coders with improved semantic understanding.Localization, and Dense Features, 6, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding.Localization, and Dense Features, 6, 2025. 1

2025
[36]

Interpreting clip with hierarchi- cal sparse autoencoders, 2025

Vladimir Zaigrajew, Hubert Baniecki, and Przemyslaw Biecek. Interpreting clip with hierarchical sparse autoen- coders.arXiv preprint arXiv:2502.20578, 2025. 1

work page arXiv 2025
[37]

Large multi-modal models can interpret features in large multi- modal models

Kaichen Zhang, Yifei Shen, Bo Li, and Ziwei Liu. Large multi-modal models can interpret features in large multi- modal models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 3650–3661,
[38]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,
[39]

SAE Configuration Across Vision Encoders A.1

6 Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP Supplementary Material A. SAE Configuration Across Vision Encoders A.1. Details for CLIP SAE Configurations This section provides additional statistics for the CLIP SAE configurations used in the main paper. We follow the same BatchTopK SAE setup for CLIP-ViT-B/16, CLIP-Vi...