pith. machine review for the scientific record. sign in

arxiv: 2604.05724 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords sparse autoencodersCLIP vision encodersinformation scopeinterpretabilitycontextual dependency scorefeature analysisvision transformers
0
0 comments X

The pith

Sparse autoencoders for CLIP separate features by whether they capture local patch information or global image signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that information scope, in addition to semantic meaning, is an important property of features learned by sparse autoencoders on CLIP vision encoders. Some features respond the same way no matter where an object appears in the image, pointing to a global scope, while others only activate for specific local positions. By introducing the Contextual Dependency Score, the work shows these scope differences cause distinct effects on what the model predicts and how confident it is. Understanding this helps diagnose why certain features contribute to decisions in ways that go beyond their apparent meaning.

Core claim

The authors claim that SAE features in CLIP can be characterized by their information scope, defined as the breadth of visual evidence they aggregate, and that this scope can be quantified using the Contextual Dependency Score based on response consistency under spatial perturbations, leading to the separation of local-scope features that are positionally stable from global-scope features that are positionally variant, with each type influencing CLIP predictions differently.

What carries the argument

The Contextual Dependency Score, which measures how much a feature's activation changes with spatial shifts in the input image to determine if its scope is local (stable) or global (variant).

If this is right

  • Local-scope features maintain consistent responses across positional changes in the image.
  • Global-scope features exhibit unpredictable shifts in activation with minor input perturbations.
  • Features of different scopes produce systematically different effects on the model's classification predictions and confidence scores.
  • The scope distinction supplies a diagnostic axis for examining SAE-derived features beyond their semantic content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scope analysis could guide feature selection in interpretability methods for other vision encoders.
  • Testing scope on downstream tasks might reveal whether it affects generalization or transfer performance.
  • Pairing scope measurements with semantic labels could produce more complete maps of what representations encode.

Load-bearing premise

That the consistency of a feature's response to spatial perturbations directly reflects its information scope rather than other factors like robustness or sparsity patterns.

What would settle it

A test where images are manipulated to move objects while keeping semantics fixed, checking if features labeled local remain stable in activation and global ones do not.

Figures

Figures reproduced from arXiv: 2604.05724 by Jaehyun Choi, Junmo Kim, Yusung Ro.

Figure 1
Figure 1. Figure 1: Spatial instability of CLIP outlier tokens under Shifted Context Cropping (SCC). We analyze the spatial consistency of patch-token L2 norms using two overlapping crops (red: top-left; blue: bottom-right) translated by a single patch. The heatmaps visualize patch-token L2 norms for each crop across different CLIP ViT models. The green and yellow spots, indicating outlier tokens, do not maintain consistent s… view at source ↗
Figure 2
Figure 2. Figure 2: Framework for quantifying our Contextual Dependency Score (CDS) of SAE features. We evaluate the contextual de￾pendency of SAE features using CDS. This score is derived from the Earth Mover’s Distance (EMD) computed between the normalized activation maps of two shifted input crops (red and blue) as shown for a single pair. The final CDS for a feature is the average of these EMD scores over kCDS representat… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of Contextual Dependency Score (CDS). Histogram of CDS values for SAE features across three CLIP models, showing a dominant low-CDS peak together with distinct peaks in the high-CDS region. This distribution motivates partitioning SAE features into low-CDS and high-CDS groups [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative activation patterns of low-CDS and high-CDS SAE features. Low-CDS features (left) show spatially grounded and semantically consistent responses, whereas high￾CDS features (right) exhibit more diffuse activations and weaker semantic consistency. is harmful for the smaller models, indicating that low-CDS features provide information that is useful for classification. The CLIP-L/14-336px result… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of per-patch L2 norms for the original and feature-removed embeddings. We visualize the per-patch L2 norms of the original embedding, the low-CDS-removed embedding, and the high-CDS-removed embedding. The comparison highlights how each feature group affects outlier-token norms and the overall spatial distribution of patch norms across the three CLIP models. line. One possible explanation is t… view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of Contextual Dependency Score (CDS). Histogram of CDS values for SAE features across DINOv2, SigLIP2, and DeiT3 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Contextual dependency across the norm spectrum. We plot the Activation-Weighted CDS (awCDS) across the token norm spectrum. Across all architectures, awCDS spikes exponen￾tially for outlier tokens. D.2. Feature-Group Removal Analysis Across ViTs Having established in D.1 that other ViT families also ex￾hibit a meaningful CDS-based feature separation, and that outlier tokens remain spatially and representat… view at source ↗
Figure 8
Figure 8. Figure 8: Linear probe accuracy across ViT scales on ImageNet-1K. The high-CDS-removed embedding (blue) consistently outper￾forms the Baseline (green) across all architectures, indicating that local signals are more discriminative for classification. Conversely, the performance gap of the low-CDS-removed embedding (red) generally narrows as model size increases, suggesting that larger models encode broader context m… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of low-CDS features. These features exhibit strong spatial grounding, consistently localizing specific visual concepts across different images [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of high-CDS features. These features exhibit diffuse activation patterns and capture broader contextual infor￾mation rather than localized visual details [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpreting the internal representations of CLIP vision encoders, yet existing analyses largely focus on the semantic meaning of individual features. We introduce information scope as a complementary dimension of interpretability that characterizes how broadly an SAE feature aggregates visual evidence, ranging from localized, patch-specific cues to global, image-level signals. We observe that some SAE features respond consistently across spatial perturbations, while others shift unpredictably with minor input changes, indicating a fundamental distinction in their underlying scope. To quantify this, we propose the Contextual Dependency Score (CDS), which separates positionally stable local scope features from positionally variant global scope features. Our experiments show that features of different information scopes exert systematically different influences on CLIP's predictions and confidence. These findings establish information scope as a critical new axis for understanding CLIP representations and provide a deeper diagnostic view of SAE-derived features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces 'information scope' as a complementary interpretability axis for SAE features in CLIP vision encoders, beyond semantic content. It observes that some features respond consistently to spatial perturbations (indicating local, patch-specific scope) while others vary (indicating global scope), and proposes the Contextual Dependency Score (CDS) to quantify and separate these. Experiments claim to show that features of differing scopes exert systematically different influences on CLIP predictions and confidence.

Significance. If validated with controls, the work could add a useful diagnostic dimension for analyzing how CLIP aggregates visual evidence in SAE decompositions. The attempt to move beyond pure semantics is timely, but the current presentation leaves the core distinction empirically underspecified.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'features of different information scopes exert systematically different influences on CLIP's predictions' is stated without any quantitative results, controls, statistical tests, or implementation details for CDS, leaving the empirical support unverifiable.
  2. [Abstract / CDS definition] CDS definition (abstract and method): the interpretation that positional stability under spatial perturbations indicates local scope while inconsistency indicates global scope is not secured against confounds such as differences in feature robustness to small changes or activation sparsity patterns. These alternatives could produce the observed consistency split without any difference in the breadth of visual evidence aggregated, undermining the claim that information scope is a distinct axis.
minor comments (2)
  1. The manuscript should include the explicit formula for CDS, any pseudocode for the perturbation procedure, and details on the SAE training and CLIP model used.
  2. Clarify whether 'systematic differences in influence' are measured via ablation, attribution, or another method, and report effect sizes or confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our empirical claims and the robustness of the Contextual Dependency Score definition. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'features of different information scopes exert systematically different influences on CLIP's predictions' is stated without any quantitative results, controls, statistical tests, or implementation details for CDS, leaving the empirical support unverifiable.

    Authors: We agree that the abstract would be strengthened by including quantitative details to make the central claim more verifiable on first reading. In the revised manuscript we have updated the abstract to report key quantitative results from our experiments (including average differences in prediction influence and associated statistical tests) along with a brief note on CDS computation. Full implementation details, controls, and all statistical analyses remain in the Methods and Experiments sections. revision: yes

  2. Referee: [Abstract / CDS definition] CDS definition (abstract and method): the interpretation that positional stability under spatial perturbations indicates local scope while inconsistency indicates global scope is not secured against confounds such as differences in feature robustness to small changes or activation sparsity patterns. These alternatives could produce the observed consistency split without any difference in the breadth of visual evidence aggregated, undermining the claim that information scope is a distinct axis.

    Authors: This concern about alternative explanations is well-taken. While the core experiments demonstrate that scope-differentiated features produce systematically different effects on CLIP predictions and confidence (effects that are not obviously predicted by robustness or sparsity differences alone), we acknowledge that the CDS definition itself does not explicitly rule out these confounds. In the revision we have added a dedicated discussion of these alternatives and included post-hoc controls that match features on sparsity; the positional-stability distinction and its downstream effects on model behavior persist under these controls. We have also noted the limits of current controls as an area for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; CDS is an explicit new definition with independent experimental follow-up

full rationale

The paper defines information scope via direct observation of SAE feature responses to spatial perturbations and introduces CDS as a quantitative separator based on that observed consistency/variance. This is a self-contained definitional step rather than a reduction of any claimed prediction or result to a fitted parameter, prior self-citation, or ansatz. Subsequent claims about differential influence on CLIP predictions are presented as separate empirical measurements, not forced by the CDS construction itself. No equations, uniqueness theorems, or self-citation chains are shown to render the central axis equivalent to its inputs by construction. The derivation remains independent of the specific interpretive mapping from consistency to scope.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested premise that positional stability under perturbations equates to local information scope. No free parameters are described. Two new entities are introduced without independent evidence outside the paper.

axioms (1)
  • domain assumption Positional stability of SAE feature responses across spatial perturbations indicates local scope while instability indicates global scope.
    Invoked to justify the separation performed by CDS.
invented entities (2)
  • information scope no independent evidence
    purpose: A complementary dimension to semantic meaning that characterizes how broadly an SAE feature aggregates visual evidence.
    New interpretability axis introduced in the abstract.
  • Contextual Dependency Score (CDS) no independent evidence
    purpose: Metric that separates positionally stable local-scope features from positionally variant global-scope features.
    Proposed quantification method.

pith-pipeline@v0.9.0 · 5457 in / 1372 out tokens · 72247 ms · 2026-05-10T19:05:40.434924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Spatio-temporal con- volutional sparse auto-encoder for sequence classification

    Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. Spatio-temporal con- volutional sparse auto-encoder for sequence classification. In BMVC, pages 1–12, 2012. 1

  2. [2]

    Registers in small vision trans- formers: A reproducibility study of vision transformers need registers.Transactions on Machine Learning Research

    Linus Ruben Bach, Emma Bakker, R ´enan van Dijk, Jip de Vries, and Konrad Szewczyk. Registers in small vision trans- formers: A reproducibility study of vision transformers need registers.Transactions on Machine Learning Research. 3

  3. [3]

    Adabins: Depth estimation using adaptive bins

    Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021. 2

  4. [4]

    Towards monosemanticity: Decomposing language mod- els with dictionary learning.Transformer Circuits Thread,

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yi- fan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Ka- rina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, an...

  5. [5]

    https://transformer-circuits.pub/2023/monosemantic- features/index.html. 1

  6. [6]

    arXiv preprint arXiv:2412.06410 , year=

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batch- topk sparse autoencoders.arXiv preprint arXiv:2412.06410,

  7. [7]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023. 1

  8. [8]

    Vision Transformers Need Registers

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 1, 3, 6

  9. [9]

    Case study: Interpreting, ma- nipulating, and controlling clip with sparse au- toencoders, 2024.URL https://www

    Gytis Daujotas. Case study: Interpreting, ma- nipulating, and controlling clip with sparse au- toencoders, 2024.URL https://www. lesswrong. com/posts/iYFuZo9BMvr6GgMs5/case-study-interpreting- manipulating-and-controlling-clip. Accessed, pages 09–24,

  10. [10]

    Interpreting and steering fea- tures in images.URL https://www

    Gytis Daujotas. Interpreting and steering fea- tures in images.URL https://www. lesswrong. com/posts/Quqekpvx8BGMMcaem/interpreting- andsteering-features-in-images, 4, 2024. 1, 2

  11. [11]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5, 6

  12. [12]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

  13. [13]

    Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 2

  14. [14]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

  15. [15]

    Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024.URL https://www

    Hugo Fry. Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024.URL https://www. lesswrong. com/posts/bCtbuWraqYTDtuARg/towards-multimodal- interpretability-learning-sparse-2# Future Work. Accessed, pages 06–28, 2024. 2

  16. [16]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupr ´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024. 1, 3

  17. [17]

    Causal in- terpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025

    Sangyu Han, Yearim Kim, and Nojun Kwak. Causal in- terpretation of sparse autoencoder features in vision.arXiv preprint arXiv:2509.00749, 2025. 1

  18. [18]

    and Lubana, Ekdeep Singh and Fel, Thomas and Ba, Demba , year = 2025, month = mar, number =

    Sai Sumedh R Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The duality between sparse autoencoders and concept geometry.arXiv preprint arXiv:2503.01822, 2025. 1

  19. [19]

    Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025

    Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandels- man. Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025. 1, 3, 6

  20. [20]

    Sparse autoencoders reveal selective remapping of visual concepts during adaptation.arXiv preprint arXiv:2412.05276,

    Hyesu Lim, Jinho Choi, Jaegul Choo, and Steffen Schnei- der. Sparse autoencoders reveal selective remapping of visual concepts during adaptation.arXiv preprint arXiv:2412.05276, 2024. 2

  21. [21]

    Prun- ing the paradox: How clip’s most informative heads en- hance performance while amplifying bias.arXiv preprint arXiv:2503.11103, 2025

    Avinash Madasu, Vasudev Lal, and Phillip Howard. Prun- ing the paradox: How clip’s most informative heads en- hance performance while amplifying bias.arXiv preprint arXiv:2503.11103, 2025. 6

  22. [22]

    Indoor segmentation and support inference from rgbd images

    Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012. 6

  23. [23]

    Feature visualization.Distill, 2017

    Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2017. https://distill.pub/2017/feature-visualization. 1

  24. [24]

    Sparse coding with an overcomplete basis set: A strategy employed by v1?Vision research, 37(23):3311–3325, 1997

    Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1?Vision research, 37(23):3311–3325, 1997. 3

  25. [25]

    Probing the representational power of sparse autoencoders in vision models

    Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Chang- bai Li, Phillip Howard, Vasudev Lal, and Shao-Yen Tseng. Probing the representational power of sparse autoencoders in vision models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 6167–6177,

  26. [26]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2

  27. [27]

    arXiv preprint arXiv:2504.02821 , year=

    Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models.arXiv preprint arXiv:2504.02821, 2025. 2

  28. [28]

    De- coding dense embeddings: Sparse autoencoders for inter- preting and discretizing dense retrieval.arXiv preprint arXiv:2506.00041, 2025

    Seongwan Park, Taeklim Kim, and Youngjoong Ko. De- coding dense embeddings: Sparse autoencoders for inter- preting and discretizing dense retrieval.arXiv preprint arXiv:2506.00041, 2025. 1

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5

  30. [30]

    Discover-then-name: Task-agnostic concept bottle- necks via automated concept discovery

    Sukrut Rao, Sweta Mahajan, Moritz B ¨ohle, and Bernt Schiele. Discover-then-name: Task-agnostic concept bottle- necks via automated concept discovery. InEuropean Con- ference on Computer Vision, pages 444–461. Springer, 2024. 1, 2

  31. [31]

    The earth mover’s distance as a metric for image retrieval.In- ternational journal of computer vision, 40(2):99–121, 2000

    Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval.In- ternational journal of computer vision, 40(2):99–121, 2000. 5

  32. [32]

    Sparse autoencoders for scientifically rigorous interpretation of vision models.arXiv preprint arXiv:2502.06755,

    Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, and Yu Su. Sparse autoencoders for scientifically rigorous interpre- tation of vision models.arXiv preprint arXiv:2502.06755,

  33. [33]

    Universal sparse au- toencoders: Interpretable cross-model concept alignment

    Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, and Konstantinos G Derpanis. Universal sparse au- toencoders: Interpretable cross-model concept alignment. In Forty-second International Conference on Machine Learn- ing, 2025. 1, 2

  34. [34]

    Deit iii: Revenge of the vit

    Hugo Touvron, Matthieu Cord, and Herv ´e J ´egou. Deit iii: Revenge of the vit. InEuropean conference on computer vision, pages 516–533. Springer, 2022. 1

  35. [35]

    Siglip 2: Multilingual vision-language en- coders with improved semantic understanding.Localization, and Dense Features, 6, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding.Localization, and Dense Features, 6, 2025. 1

  36. [36]

    Interpreting clip with hierarchi- cal sparse autoencoders, 2025

    Vladimir Zaigrajew, Hubert Baniecki, and Przemyslaw Biecek. Interpreting clip with hierarchical sparse autoen- coders.arXiv preprint arXiv:2502.20578, 2025. 1

  37. [37]

    Large multi-modal models can interpret features in large multi- modal models

    Kaichen Zhang, Yifei Shen, Bo Li, and Ziwei Liu. Large multi-modal models can interpret features in large multi- modal models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 3650–3661,

  38. [38]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

  39. [39]

    SAE Configuration Across Vision Encoders A.1

    6 Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP Supplementary Material A. SAE Configuration Across Vision Encoders A.1. Details for CLIP SAE Configurations This section provides additional statistics for the CLIP SAE configurations used in the main paper. We follow the same BatchTopK SAE setup for CLIP-ViT-B/16, CLIP-Vi...