Towards Robustness against Typographic Attack with Training-free Concept Localization

Aidong Zhang; Bohan Liu; Guangzhi Xiong; Sanchit Sinha; Wenqian Ye; Zhenghao He

arxiv: 2607.02494 · v1 · pith:OHB5BAZ2new · submitted 2026-07-02 · 💻 cs.CV · cs.CL

Towards Robustness against Typographic Attack with Training-free Concept Localization

Bohan Liu , Wenqian Ye , Guangzhi Xiong , Zhenghao He , Sanchit Sinha , Aidong Zhang This is my paper

Pith reviewed 2026-07-03 14:39 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords typographic attackvision transformerattention headsmechanistic interpretabilityCLIP robustnesstraining-free interventioncircuit miningconcept localization

0 comments

The pith

Specific attention heads in vision transformers encode the lexical bias behind typographic attacks and can be adjusted directly to restore visual focus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that typographic attacks arise because certain attention heads in CLIP's Vision Transformer prioritize text semantics over image content. Using sampling-based attribution to score heads and probabilistic analysis to mine the responsible circuits, the authors isolate those components without any retraining. Direct interventions such as selective attention-weight adjustment on the identified heads then raise classification accuracy on attacked images and improve downstream VQA performance in large vision-language models. A reader would care because the fix is training-free, interpretable, and outperforms existing defense baselines on standard benchmarks.

Core claim

The central claim is that sampling-based interpretations of hidden states, followed by quantitative attribution of semantic versus lexical focus and probabilistic circuit mining, isolate particular ViT attention heads that disproportionately encode lexical information; simple interventions applied to those heads, such as selective adjustment of attention weights, substantially improve robustness to typographic attacks in object classification and yield measurable gains when applied to the vision encoders of state-of-the-art LVLMs on RIO-Bench VQA tasks under attack.

What carries the argument

Sampling-based attribution combined with probabilistic circuit mining that isolates individual ViT attention heads encoding lexical information.

If this is right

Interventions on the mined circuits raise object-classification accuracy under typographic attack without any model retraining.
The same interventions outperform both supervised and other training-free defense methods.
Applying the intervention to vision encoders of multiple large vision-language models produces clear accuracy gains on Visual Question Answering under typographic interference.
The mechanistic localization is reusable across different CLIP-based architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The attacks appear to be carried by a small, localized subset of heads rather than being diffusely distributed across the entire transformer.
Similar attribution-plus-intervention pipelines could be tested on other documented failure modes such as hallucination or spatial reasoning errors in the same models.
Because the method requires no gradient updates, it could be applied at inference time on deployed systems where retraining is costly or impossible.
The approach suggests that interpretability techniques originally developed for language models can be transferred to isolate and patch concrete visual-semantic conflicts.

Load-bearing premise

The sampling procedure and subsequent probabilistic analysis correctly identify the specific attention heads whose lexical encoding is the mechanistic cause of typographic attacks.

What would settle it

A controlled test in which the identified heads are ablated or re-weighted exactly as proposed yet robustness on typographic-attack images shows no improvement or worsens relative to an untouched baseline.

Figures

Figures reproduced from arXiv: 2607.02494 by Aidong Zhang, Bohan Liu, Guangzhi Xiong, Sanchit Sinha, Wenqian Ye, Zhenghao He.

**Figure 1.** Figure 1: An overview of the proposed method. a) shows the sampling-based mechanistic interpretability (the Stochastic Lottery) for lexical circuit mining. Latent samples (denoted by purple dashed arrows) on the hypothesized concept basis reveal distinct attribution patterns. b) illustrates the overall sampling and circuit-mining process using gradient-based attribution. c) shows the mechanistic intervention proce… view at source ↗

**Figure 2.** Figure 2: Attribution Map of Aligned and Unaligned Concept Vectors. Left: [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Concept Localization. Attribution map of selected random concept vectors with high text focus indicated by a high nTAS (Left), and with low text focus indicated by a low nTAS (Right). High attribution-score patches are indicated by a red gradient. 5.3 Influence of the Number of Samples. As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of the number of samples (parametrized by expansion ratio) on the method stability (in the standard deviation of the module score across random seeds). Our method stabilizes at 8 samples per dimension across ViT sizes and continues to stabilize as the number of samples increases. to LVLM demonstrates the ability to improve the VQA accuracy of LVLMs under lexical distraction. Finally, the qualitative… view at source ↗

**Figure 5.** Figure 5: Example of text-augmented images. Texts with various colors and fonts occur at each border with equal likelihood. Evaluation Datasets. We evaluate the defense methods on the following datasets: – RTA-100 [1] consists of 100 categories and a 1000-item mixture of synthetic and real-world typographic attack images. The images are drawn from 10 regular image datasets, 2 typographic attack datasets, and additio… view at source ↗

**Figure 6.** Figure 6: Prompt for Typographic Image Editing. text_type refers to one of the seven text types we defined. text_word is chosen randomly from the class label set, excluding the ground truth label of the input image. gt_text refers to the ground truth label of the input image. A.2 Derivation of Psuccess Bounding Interference via Concentration of Measure. We establish a high-probability bound on the maximum polysemant… view at source ↗

**Figure 7.** Figure 7: Example data from IN-100-Text. Derivation of the Single Success Probability (p). Next, we derive the probability p that a single random probe u resolves the weakest on-concept patch k = arg mini∈on αi above the background interference. Since every onconcept patch carries the concept at least as strongly as αk, resolving patch k resolves all on-concept patches simultaneously. The alignment of the probe wit… view at source ↗

**Figure 8.** Figure 8: Mean nTAS of Attention Heads across ViT layers. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Classification Accuracy on Typographic-Attack Datasets and [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a training-free way to locate lexical heads in ViT encoders via sampling and circuit mining, then intervene on them to cut typographic attack effects in CLIP-based models.

read the letter

The main takeaway is that simple, direct edits to a few attention heads identified through sampling-based attribution and probabilistic circuit mining can lift robustness to typographic attacks without any retraining. They test this on object classification and then on VQA for several LVLMs using RIO-Bench, claiming gains over both supervised and training-free baselines.

What is new is the specific pipeline: sampling hidden states to attribute semantic versus lexical focus per head, followed by mining to isolate the lexical ones and applying targeted interventions like attention weight adjustment. Releasing the code at the GitHub link is a clear plus for anyone who wants to check the claims.

The central assumption—that the attribution step reliably finds the heads driving the attack—needs the experiments to hold up, and the abstract leaves the exact baseline setups and statistical controls unclear. That makes the outperformance numbers hard to assess from the text alone, though the released code gives a direct path to verification. No obvious internal contradictions show up in the described method.

This work is aimed at people working on multimodal robustness and mechanistic interpretability of vision encoders. It is worth sending to peer review because the mechanistic framing plus code release makes the claims testable, even if the experimental details will need close checking.

Referee Report

0 major / 3 minor

Summary. The paper claims that a training-free mechanistic interpretability pipeline—sampling-based attribution of hidden states, probabilistic analysis to quantify lexical vs. semantic focus per attention head, and circuit mining—identifies specific ViT components in CLIP encoders as the source of typographic attacks; simple interventions on those circuits (e.g., selective attention-weight adjustment) then yield substantial robustness gains in object classification that outperform both supervised and training-free baselines, with the same interventions generalizing to improve VQA accuracy under TA on RIO-Bench for multiple LVLMs.

Significance. If the attribution correctly isolates lexical-encoding heads and the interventions prove effective without retraining, the work supplies both a mechanistic account of a known CLIP failure mode and a practical, interpretable defense that could be applied to safety-critical vision-language systems; the released code and explicit intervention descriptions strengthen verifiability.

minor comments (3)

Abstract and §3: the description of the probabilistic analysis that attributes 'lexical information' to heads would benefit from an explicit formula or pseudocode for the attribution score, as the current prose leaves the exact aggregation over samples unclear.
§4.2 and Table 2: the claim that interventions 'outperform both supervised and training-free defense methods' should include the precise baseline implementations and hyper-parameters used for comparison to allow direct replication.
Figure 3 caption: axis labels and the meaning of the color scale for head-wise lexical scores are not defined in the caption or surrounding text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were raised in the report, so we have no point-by-point responses to provide. We will address any minor issues identified during the revision process.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a sampling-based attribution method to identify ViT attention heads encoding lexical information via probabilistic analysis and circuit mining, followed by direct training-free interventions such as attention weight adjustment. No equations or steps reduce the reported robustness gains to quantities fitted on the same attack data by construction, nor does the central claim rely on self-citation chains or imported uniqueness theorems. The method is presented as mechanistic analysis with released code for external verification, making the derivation self-contained against the described inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on standard probabilistic sampling assumptions and the existence of identifiable 'circuits' in ViT attention.

pith-pipeline@v0.9.1-grok · 5818 in / 1110 out tokens · 19915 ms · 2026-07-03T14:39:46.958732+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 17 canonical work pages · 5 internal anchors

[1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Azuma, H., Matsui, Y.: Defense-prefix for preventing typographic attacks on CLIP. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3644–3653 (2023)

2023
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-VL technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cao, Y., Xing, Y., Zhang, J., Lin, D., Zhang, T., Tsang, I., Liu, Y., Guo, Q.: Scenetap: Scene-coherent typographic adversarial planner against vision-language models in real-world environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 25050–25059 (June 2025)

2025
[5]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24185–24198. IEEE (2024)

2024
[6]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIX

Cheng, H., Xiao, E., Gu, J., Yang, L., Duan, J., Zhang, J., Cao, J., Xu, K., Xu, R.: Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIX. p. 179–196. Springer-Verlag, Berlin, ...

work page doi:10.1007/978-3-031-73202-7_11 2024
[7]

arXiv preprint arXiv:2405.14169 (2024)

Chung, N., Gao, S., Vu, T.A., Zhang, J., Liu, A., Lin, Y., Dong, J.S., Guo, Q.: Towards transferable attacks against vision-llms in autonomous driving with ty- pography. arXiv preprint arXiv:2405.14169 (2024)

work page arXiv 2024
[8]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=2dnO3LLiJ1

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need regis- ters. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=2dnO3LLiJ1

2024
[9]

In: 2009 IEEE Conference on Computer Vision and Show Me Examples 17 Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009).https://doi.org/10.1109/CVPR.2009. 5206848

work page doi:10.1109/cvpr.2009 2009
[10]

In: In- ternational Conference on Learning Representations (2021),https://openreview

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: In- ternational Conference on Learning Representations (2021),https://openreview. net/forum?id=YicbFdNTTy

2021
[11]

arXiv preprint arXiv:2505.20229 (2025)

Dreyer, M., Hufe, L., Berend, J., Wiegand, T., Lapuschkin, S., Samek, W.: From what to how: Attributing CLIP’s latent components reveals unexpected semantic reliance. arXiv preprint arXiv:2505.20229 (2025)

work page arXiv 2025
[12]

Transformer Circuits Thread (2022), https://transformer- circuits.pub/2022/toy_model/index.html Training-free Concept Localization against Typographic Attack 17

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCan- dlish, S., Kaplan, J., Amodei, D., Wattenberg, M., Olah, C.: Toy mod- els of superposition. Transformer Circuits Thread (2022), https://transformer- circuits.pub/2022/toy_model/index.html Training-free Concept ...

2022
[13]

Transformer Circuits Thread (2021), https://transformer-circuits.pub/2021/framework/index.html

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield- Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., Olah, C.: A mathemat- ical framework for transformer circ...

2021
[14]

In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=rJl-b3RcF7

Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable neural networks. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=rJl-b3RcF7

2019
[15]

In: The Twelfth International Confer- ence on Learning Representations (2024),https://openreview.net/forum?id= 5Ca9sSzuDp

Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting CLIP’s image repre- sentation via text-based decomposition. In: The Twelfth International Confer- ence on Learning Representations (2024),https://openreview.net/forum?id= 5Ca9sSzuDp

2024
[16]

arXiv preprint arXiv:2406.04341 (2024),https://arxiv.org/ abs/2406.04341

Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting the second-order effects of neurons in CLIP. arXiv preprint arXiv:2406.04341 (2024),https://arxiv.org/ abs/2406.04341

work page arXiv 2024
[17]

Gemma 3 Technical Report

Gemma Team, Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., Mesnard, T., Cideron, G., Grill, J.B., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Coleman,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Distill6(3), e30 (2021)

Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A., Olah, C.: Multimodal neurons in artificial neural networks. Distill6(3), e30 (2021)

2021
[19]

In: The Twelfth In- ternational Conference on Learning Representations (2024),https://openreview

Huben, R., Cunningham, H., Smith, L.R., Ewart, A., Sharkey, L.: Sparse autoen- coders find highly interpretable features in language models. In: The Twelfth In- ternational Conference on Learning Representations (2024),https://openreview. net/forum?id=F76bwRSLeK

2024
[20]

In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=UI7mbsIZeN

Hufe, L., Venhoff, C., Dreyer, M., Purelku, E., Lapuschkin, S., Samek, W.: Dyslexify: A mechanistic defense against typographic attacks in CLIP. In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=UI7mbsIZeN

2026
[21]

Advances in Neural Information Processing Systems35, 29262–29277 (2022)

Ilharco, G., Wortsman, M., Gadre, S.Y., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., Schmidt, L.: Patching open-vocabulary models by interpolating weights. Advances in Neural Information Processing Systems35, 29262–29277 (2022)

2022
[22]

arXiv preprint arXiv:2406.17759 (2024),https://arxiv.org/abs/2406.17759

Kissane, C., Krzyzanowski, R., Bloom, J.I., Conmy, A., Nanda, N.: Interpreting attention layer outputs with sparse autoencoders. arXiv preprint arXiv:2406.17759 (2024),https://arxiv.org/abs/2406.17759

work page arXiv 2024
[23]

In: Bouamor, H., Pino, J., Bali, K

Li, C., Wang, S., Zhang, Y., Zhang, J., Zong, C.: Interpreting and exploiting functional specialization in multi-head attention under multi-task learning. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 16460–16476. Associ- ation for Computational Linguistics, Singapore...

work page doi:10.18653/v1/2023.emnlp-main.1026 2023
[24]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in CLIP. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16410–16419 (June 2022)

2022
[26]

In: Proceedings of the 41st International Con- ference on Machine Learning

Park, K., Choe, Y.J., Veitch, V.: The linear representation hypothesis and the geometry of large language models. In: Proceedings of the 41st International Con- ference on Machine Learning. ICML’24, JMLR.org (2024)

2024
[27]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceed- ings of Machine Learning Res...

2021
[28]

Advances in neural information processing systems35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

2022
[29]

Kaggle (2021),https://www.kaggle.com/datasets/ ambityga/imagenet100, accessed: 2026-01-16

Shekhar, A.: ImageNet100. Kaggle (2021),https://www.kaggle.com/datasets/ ambityga/imagenet100, accessed: 2026-01-16

2021
[30]

arXiv preprint arXiv:2506.17052 (2025)

Su, J., Kempe, J., Ullrich, K.: From concepts to components: Concept-agnostic at- tention module discovery in transformers. arXiv preprint arXiv:2506.17052 (2025)

work page arXiv 2025
[31]

In: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI

Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI. p. 776–794. Springer-Verlag, Berlin, Heidelberg (2020). Training-free Concept Localization against Typographic Attack 19 https://doi.org/10.1007/978-3-030-58621-8_45,https://doi....

work page doi:10.1007/978-3-030-58621-8_45 2020
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9568–9578 (2024)

2024
[33]

In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),https://proceedings.neurips....

2017
[34]

In: ICLR 2025 Workshop on Foundation Models in the Wild (2025),https://openreview.net/forum?id= U4z69U9m9t

Wang, G., Bai, L., Nah, W.J., Wang, J., Zhang, Z., Chen, Z., Wu, J., Islam, M., Liu, H., Ren, H.: Surgical-LVLM: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery. In: ICLR 2025 Workshop on Foundation Models in the Wild (2025),https://openreview.net/forum?id= U4z69U9m9t

2025
[35]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

In: Chiruzzo, L., Ritter, A., Wang, L

Wang, X., Zhao, Z., Larson, M.: Typographic attacks in a multi-image setting. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 12594–12604. Association for Computational Lingu...

work page doi:10.18653/v1/2025.naacl- 2025
[37]

CoRRabs/2512.11899(2025).https://doi.org/10

Waseda, F., Yamabe, S., Shiono, D., Sasaki, K., Takahashi, T.: Read or ignore? A unified benchmark for typographic-attack robustness and text recognition in vision-language models. CoRRabs/2512.11899(2025).https://doi.org/10. 48550/ARXIV.2512.11899,https://doi.org/10.48550/arXiv.2512.11899

work page doi:10.48550/arxiv.2512.11899 2025
[38]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-Image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Llm4drive: A survey of large language models for autonomous driving.arXiv preprint arXiv:2311.01043, 2023

Yang, Z., Jia, X., Li, H., Yan, J.: LLM4Drive: A survey of large language models for autonomous driving. arXiv preprint arXiv:2311.01043 (2023)

work page arXiv 2023
[40]

{text_word}

Zaigrajew, V., Baniecki, H., Biecek, P.: Interpreting CLIP with hierarchical sparse autoencoders. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id=5MQQsenQBm 20 B. Liu et al. A Appendix A.1 Dataset Details Circuit Mining Dataset.Our training-free concept mining process takes a text-injected image dataset...

2025

[1] [1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Azuma, H., Matsui, Y.: Defense-prefix for preventing typographic attacks on CLIP. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3644–3653 (2023)

2023

[2] [2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-VL technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cao, Y., Xing, Y., Zhang, J., Lin, D., Zhang, T., Tsang, I., Liu, Y., Guo, Q.: Scenetap: Scene-coherent typographic adversarial planner against vision-language models in real-world environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 25050–25059 (June 2025)

2025

[5] [5]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24185–24198. IEEE (2024)

2024

[6] [6]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIX

Cheng, H., Xiao, E., Gu, J., Yang, L., Duan, J., Zhang, J., Cao, J., Xu, K., Xu, R.: Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language models. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LIX. p. 179–196. Springer-Verlag, Berlin, ...

work page doi:10.1007/978-3-031-73202-7_11 2024

[7] [7]

arXiv preprint arXiv:2405.14169 (2024)

Chung, N., Gao, S., Vu, T.A., Zhang, J., Liu, A., Lin, Y., Dong, J.S., Guo, Q.: Towards transferable attacks against vision-llms in autonomous driving with ty- pography. arXiv preprint arXiv:2405.14169 (2024)

work page arXiv 2024

[8] [8]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=2dnO3LLiJ1

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need regis- ters. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=2dnO3LLiJ1

2024

[9] [9]

In: 2009 IEEE Conference on Computer Vision and Show Me Examples 17 Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009).https://doi.org/10.1109/CVPR.2009. 5206848

work page doi:10.1109/cvpr.2009 2009

[10] [10]

In: In- ternational Conference on Learning Representations (2021),https://openreview

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: In- ternational Conference on Learning Representations (2021),https://openreview. net/forum?id=YicbFdNTTy

2021

[11] [11]

arXiv preprint arXiv:2505.20229 (2025)

Dreyer, M., Hufe, L., Berend, J., Wiegand, T., Lapuschkin, S., Samek, W.: From what to how: Attributing CLIP’s latent components reveals unexpected semantic reliance. arXiv preprint arXiv:2505.20229 (2025)

work page arXiv 2025

[12] [12]

Transformer Circuits Thread (2022), https://transformer- circuits.pub/2022/toy_model/index.html Training-free Concept Localization against Typographic Attack 17

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCan- dlish, S., Kaplan, J., Amodei, D., Wattenberg, M., Olah, C.: Toy mod- els of superposition. Transformer Circuits Thread (2022), https://transformer- circuits.pub/2022/toy_model/index.html Training-free Concept ...

2022

[13] [13]

Transformer Circuits Thread (2021), https://transformer-circuits.pub/2021/framework/index.html

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield- Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., Olah, C.: A mathemat- ical framework for transformer circ...

2021

[14] [14]

In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=rJl-b3RcF7

Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable neural networks. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=rJl-b3RcF7

2019

[15] [15]

In: The Twelfth International Confer- ence on Learning Representations (2024),https://openreview.net/forum?id= 5Ca9sSzuDp

Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting CLIP’s image repre- sentation via text-based decomposition. In: The Twelfth International Confer- ence on Learning Representations (2024),https://openreview.net/forum?id= 5Ca9sSzuDp

2024

[16] [16]

arXiv preprint arXiv:2406.04341 (2024),https://arxiv.org/ abs/2406.04341

Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting the second-order effects of neurons in CLIP. arXiv preprint arXiv:2406.04341 (2024),https://arxiv.org/ abs/2406.04341

work page arXiv 2024

[17] [17]

Gemma 3 Technical Report

Gemma Team, Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., Mesnard, T., Cideron, G., Grill, J.B., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Coleman,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Distill6(3), e30 (2021)

Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A., Olah, C.: Multimodal neurons in artificial neural networks. Distill6(3), e30 (2021)

2021

[19] [19]

In: The Twelfth In- ternational Conference on Learning Representations (2024),https://openreview

Huben, R., Cunningham, H., Smith, L.R., Ewart, A., Sharkey, L.: Sparse autoen- coders find highly interpretable features in language models. In: The Twelfth In- ternational Conference on Learning Representations (2024),https://openreview. net/forum?id=F76bwRSLeK

2024

[20] [20]

In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=UI7mbsIZeN

Hufe, L., Venhoff, C., Dreyer, M., Purelku, E., Lapuschkin, S., Samek, W.: Dyslexify: A mechanistic defense against typographic attacks in CLIP. In: The Fourteenth International Conference on Learning Representations (2026),https: //openreview.net/forum?id=UI7mbsIZeN

2026

[21] [21]

Advances in Neural Information Processing Systems35, 29262–29277 (2022)

Ilharco, G., Wortsman, M., Gadre, S.Y., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., Schmidt, L.: Patching open-vocabulary models by interpolating weights. Advances in Neural Information Processing Systems35, 29262–29277 (2022)

2022

[22] [22]

arXiv preprint arXiv:2406.17759 (2024),https://arxiv.org/abs/2406.17759

Kissane, C., Krzyzanowski, R., Bloom, J.I., Conmy, A., Nanda, N.: Interpreting attention layer outputs with sparse autoencoders. arXiv preprint arXiv:2406.17759 (2024),https://arxiv.org/abs/2406.17759

work page arXiv 2024

[23] [23]

In: Bouamor, H., Pino, J., Bali, K

Li, C., Wang, S., Zhang, Y., Zhang, J., Zong, C.: Interpreting and exploiting functional specialization in multi-head attention under multi-task learning. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 16460–16476. Associ- ation for Computational Linguistics, Singapore...

work page doi:10.18653/v1/2023.emnlp-main.1026 2023

[24] [24]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in CLIP. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16410–16419 (June 2022)

2022

[26] [26]

In: Proceedings of the 41st International Con- ference on Machine Learning

Park, K., Choe, Y.J., Veitch, V.: The linear representation hypothesis and the geometry of large language models. In: Proceedings of the 41st International Con- ference on Machine Learning. ICML’24, JMLR.org (2024)

2024

[27] [27]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceed- ings of Machine Learning Res...

2021

[28] [28]

Advances in neural information processing systems35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

2022

[29] [29]

Kaggle (2021),https://www.kaggle.com/datasets/ ambityga/imagenet100, accessed: 2026-01-16

Shekhar, A.: ImageNet100. Kaggle (2021),https://www.kaggle.com/datasets/ ambityga/imagenet100, accessed: 2026-01-16

2021

[30] [30]

arXiv preprint arXiv:2506.17052 (2025)

Su, J., Kempe, J., Ullrich, K.: From concepts to components: Concept-agnostic at- tention module discovery in transformers. arXiv preprint arXiv:2506.17052 (2025)

work page arXiv 2025

[31] [31]

In: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI

Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI. p. 776–794. Springer-Verlag, Berlin, Heidelberg (2020). Training-free Concept Localization against Typographic Attack 19 https://doi.org/10.1007/978-3-030-58621-8_45,https://doi....

work page doi:10.1007/978-3-030-58621-8_45 2020

[32] [32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9568–9578 (2024)

2024

[33] [33]

In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),https://proceedings.neurips....

2017

[34] [34]

In: ICLR 2025 Workshop on Foundation Models in the Wild (2025),https://openreview.net/forum?id= U4z69U9m9t

Wang, G., Bai, L., Nah, W.J., Wang, J., Zhang, Z., Chen, Z., Wu, J., Islam, M., Liu, H., Ren, H.: Surgical-LVLM: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery. In: ICLR 2025 Workshop on Foundation Models in the Wild (2025),https://openreview.net/forum?id= U4z69U9m9t

2025

[35] [35]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

In: Chiruzzo, L., Ritter, A., Wang, L

Wang, X., Zhao, Z., Larson, M.: Typographic attacks in a multi-image setting. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Compu- tational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 12594–12604. Association for Computational Lingu...

work page doi:10.18653/v1/2025.naacl- 2025

[37] [37]

CoRRabs/2512.11899(2025).https://doi.org/10

Waseda, F., Yamabe, S., Shiono, D., Sasaki, K., Takahashi, T.: Read or ignore? A unified benchmark for typographic-attack robustness and text recognition in vision-language models. CoRRabs/2512.11899(2025).https://doi.org/10. 48550/ARXIV.2512.11899,https://doi.org/10.48550/arXiv.2512.11899

work page doi:10.48550/arxiv.2512.11899 2025

[38] [38]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-Image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Llm4drive: A survey of large language models for autonomous driving.arXiv preprint arXiv:2311.01043, 2023

Yang, Z., Jia, X., Li, H., Yan, J.: LLM4Drive: A survey of large language models for autonomous driving. arXiv preprint arXiv:2311.01043 (2023)

work page arXiv 2023

[40] [40]

{text_word}

Zaigrajew, V., Baniecki, H., Biecek, P.: Interpreting CLIP with hierarchical sparse autoencoders. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/forum?id=5MQQsenQBm 20 B. Liu et al. A Appendix A.1 Dataset Details Circuit Mining Dataset.Our training-free concept mining process takes a text-injected image dataset...

2025