MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Alexandros Stergiou; Animesh Jain

arxiv: 2508.07833 · v3 · submitted 2025-08-11 · 💻 cs.CV

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain , Alexandros Stergiou This is my paper

Pith reviewed 2026-05-18 23:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords model inversionvision-language modelsinterpretabilitymultimodal inversionfeature alignmentregularizationconcept visualization

0 comments

The pith

MIMIC inverts internal encodings of vision-language models to recover visual concepts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models encode multimodal inputs in complex internal states that resist direct inspection. The paper introduces the MIMIC framework to invert those states back into images that represent the encoded visual concepts. It does so with joint inversion that respects the model's autoregressive processing and adds regularizers to enforce spatial consistency, image naturalness, and semantic fidelity. A reader would care because this inversion supplies a concrete way to see what the model has internalized, rather than relying on indirect text explanations. If the method holds, it supplies both visual examples and quantitative scores that link model internals to human-interpretable outputs across free-form responses of different lengths.

Core claim

MIMIC performs the first model inversion for visual interpretations of VLM concepts by combining joint VLM-based inversion, feature alignment to match autoregressive behavior, and a triplet of regularizers that enforce spatial alignment, natural image smoothness, and semantic realism, then validates the outputs with both standard visual metrics and semantic text-based metrics.

What carries the argument

Joint VLM-based inversion with feature alignment plus a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism.

If this is right

Visual inspection of VLM concepts becomes feasible for outputs of varying length.
Quantitative comparison of model concepts against human judgments is now possible using both image quality and text-semantic scores.
Transparency increases because generated images can be directly compared to what the model claims to have understood.
Debugging of VLM behavior gains a visual channel that reveals mismatches between internal encodings and intended meaning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inversion pipeline could be tested on other multimodal architectures to see whether the regularizer triplet generalizes beyond current VLMs.
If the recovered images prove stable, they could serve as training targets for fine-tuning models toward more human-aligned concepts.
The method opens a route to systematic auditing of large-scale VLMs by turning opaque internal states into a visual audit trail.

Load-bearing premise

The feature alignment and three regularizers together can recover visual concepts from VLM encodings without adding artifacts that change the original semantics.

What would settle it

Run MIMIC on known VLM prompts such as 'a red sports car on a highway' and test whether human raters or separate semantic embedding distances judge the generated images as matching the prompt semantics at rates above random chance.

Figures

Figures reproduced from arXiv: 2508.07833 by Alexandros Stergiou, Animesh Jain.

**Figure 1.** Figure 1: Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) synthesizes adversarial visual prompts optimized on VLM text outputs. The synthesized images represent the dominant visual features associated with semantic concepts. We validate the effectiveness of the synthesized images using a diverse set of evaluation metrics that capture classification accuracy, semantic alignment with textua… view at source ↗

**Figure 2.** Figure 2: MIMIC inversion pipeline and synthesized concepts for goldfish, golden retriever, and corn. We optimize a noisy image using our proposed aggregated objective with an adapted cross-entropy loss LSCE based on a [target] textual concept and a base feature loss Lbase that matches the layer statistics for synthesized images and real images for [target]. Regularizers R are added to promote semantic consistency, … view at source ↗

**Figure 3.** Figure 3: Synthesized concepts with different optimization settings for goldfish, golden retriever, tiger, pretzel, and corn. Each row shows outputs generated using a combination of optimization objectives. Aggregated Objective. Our final combined optimization objective is backpropagated iteratively to update vb over iterative steps, e.g. for i → i + 1 : vbi+1 =min vbi γ1 L SCE (s(Φθϕ ([G(t), Eθe (vb)]))+γ2 L base … view at source ↗

**Figure 5.** Figure 5: Top-1 accuracy (yellow) and CLIPScore (blue) for target [goldfish] across output lengths. IV. RESULTS Model Details We invert LLaVA-1.5 [33] that consists of a CLIP ViT-L/14 [34] vision encoder and a LLaMA-3-8BInstruct [35] language model. Model parameters remain frozen and the updatable input vb ∈ R 3×336×336 initialized from a Gaussian distribution vb ∼ N (0, 1) includes the only updatable parameters. W… view at source ↗

**Figure 6.** Figure 6: Target length |yb| ablations for target concept [goldfish]. Little variance in the synthesized image quality is shown across target lengths. and ROUGE-L scores due to better lexical and structural conformity with the template. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

read the original abstract

Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIMIC adapts inversion techniques to VLMs with feature alignment and regularizers, but the abstract gives no results to show it works.

read the letter

The main takeaway is that this paper introduces MIMIC as the first inversion approach aimed at recovering visual concepts from VLM internal encodings, specifically handling autoregressive outputs through joint inversion plus feature alignment and a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. It targets free-form VLM outputs of different lengths and plans to report both visual quality and semantic text metrics. This is a direct response to the transparency limits in current multimodal models. The framing of the problem is clear and the proposed structure builds logically on existing inversion ideas by adding VLM-specific pieces. The regularizers address plausible concerns like artifacts and semantic drift, which shows some careful thinking about what could go wrong in the optimization. The soft spots are the missing pieces. The abstract contains no numbers, no baseline comparisons, no ablations on the regularizers, and no example outputs, so there is no way to check whether the method actually recovers faithful concepts or just generates plausible images. The central assumption that the combination avoids distorting the original semantics stays untested on the information given. Citation patterns look standard, referencing prior unimodal inversion work without obvious gaps in the high-level positioning. This paper is for researchers working on interpretability and explainability for vision-language models. Someone already thinking about visualization tools for black-box multimodal systems could extract the high-level recipe and try adapting it, but the value hinges on whether the full experiments hold up. I would send it to peer review so referees can examine the implementation details and the actual quantitative and qualitative results.

Referee Report

1 major / 2 minor

Summary. The paper proposes MIMIC, a framework for inverting the internal encodings of Vision-Language Models (VLMs) to enable visual interpretations of concepts. It employs joint VLM-based inversion combined with a feature alignment objective to handle autoregressive processing, plus a triplet of regularizers enforcing spatial alignment, natural image smoothness, and semantic realism. The method is evaluated quantitatively via standard visual quality metrics and semantically via text-based metrics on free-form VLM outputs of varying lengths, with the claim that it is the first model inversion approach targeting visual interpretations of VLM concepts.

Significance. If the quantitative and qualitative results hold under scrutiny, this would represent a meaningful advance in VLM interpretability by providing a practical way to recover visual concepts from internal multimodal encodings. The combination of feature alignment with domain-specific regularizers addresses a clear gap between existing inversion techniques (primarily for unimodal models) and the needs of autoregressive VLMs, potentially aiding transparency and trust in deployed multimodal systems.

major comments (1)

[Evaluation section (quantitative results)] The central claim that the triplet of regularizers plus feature alignment faithfully recovers semantics without distorting artifacts is load-bearing, yet the manuscript provides no ablation isolating the contribution of each regularizer or quantifying artifact introduction (e.g., via semantic drift metrics before/after each term). This leaves open whether the observed visual quality stems from the inversion procedure itself or from the specific regularizer balance.

minor comments (2)

[Abstract] The abstract states that results include 'standard visual quality metrics and semantic text-based metrics' but does not report any concrete values or baselines; adding one or two key numbers (with error bars) would strengthen the summary.
[Method] Notation for the feature alignment objective and the three regularizer terms should be introduced with explicit equations early in the method section to improve readability for readers unfamiliar with VLM inversion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work. We address the single major comment below.

read point-by-point responses

Referee: [Evaluation section (quantitative results)] The central claim that the triplet of regularizers plus feature alignment faithfully recovers semantics without distorting artifacts is load-bearing, yet the manuscript provides no ablation isolating the contribution of each regularizer or quantifying artifact introduction (e.g., via semantic drift metrics before/after each term). This leaves open whether the observed visual quality stems from the inversion procedure itself or from the specific regularizer balance.

Authors: We agree that isolating the contribution of each regularizer would provide stronger evidence for the central claim. Our current results demonstrate that the full combination of joint inversion, feature alignment, and the three regularizers yields superior visual quality and semantic fidelity compared to baselines. However, to directly address the concern about potential artifacts or whether gains arise primarily from the base inversion, we will add targeted ablations in the revised manuscript. These will include variants that disable each regularizer individually (spatial alignment, smoothness, and semantic realism) while keeping the inversion and alignment objectives fixed, and we will report both standard visual metrics and new semantic drift metrics (e.g., CLIP-based concept consistency before/after each term). This will clarify the necessity of the balanced regularizer set for artifact-free recovery. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces MIMIC as a new optimization-based inversion framework that combines joint VLM inversion, feature alignment for autoregressive processing, and three explicit regularizers (spatial alignment, smoothness, semantic realism). No equations, fitted parameters, or self-citations are presented that reduce the inversion outputs or claimed visual interpretations to a re-expression of the inputs by construction. The central procedure is described as a novel combination of standard inversion techniques with domain-specific regularizers, evaluated on independent quantitative and qualitative metrics, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method appears to rest on standard VLM autoregressive assumptions and the unstated premise that inversion is feasible.

pith-pipeline@v0.9.0 · 5663 in / 1022 out tokens · 24922 ms · 2026-05-18T23:43:09.882750+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MIMIC uses a joint VLM-based inversion and a feature alignment objective... triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism... aggregated objective γ1 L_SCE + γ2 L_base + R(bv)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
cs.CV 2025-11 unverdicted novelty 7.0

TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Faithlm: Towards faithful explanations for large language models,

Yu-Neng Chuang, Guanchu Wang, et al., “Faithlm: Towards faithful explanations for large language models,” arxiv:2402.04678, 2024

work page arXiv 2024
[2]

Selfie: Self-interpretation of large language model em- beddings,

Haozhe Chen, Carl V ondrick, and Chengzhi Mao, “Selfie: Self-interpretation of large language model em- beddings,” inICML, 2024

work page 2024
[3]

Patch- scopes: A unifying framework for inspecting hidden representations of language models,

Asma Ghandeharioun, Avi Caciularu, et al., “Patch- scopes: A unifying framework for inspecting hidden representations of language models,” inICML, 2024

work page 2024
[4]

Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

Roger Grosse, Juhan Bae, et al., “Studying large lan- guage model generalization with influence functions,” arxiv:2308.03296, 2023

work page arXiv 2023
[5]

Locating and editing factual associations in gpt,

Kevin Meng, David Bau, et al., “Locating and editing factual associations in gpt,” inNeurIPS, 2023

work page 2023
[6]

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez and Been Kim, “Towards a rigorous science of interpretable machine learning,” arxiv:1702.08608, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

Ramprasaath R Selvaraju, Michael Cogswell, et al., “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inICCV, 2017

work page 2017
[8]

Ax- iomatic attribution for deep networks,

Mukund Sundararajan, Ankur Taly, and Qiqi Yan, “Ax- iomatic attribution for deep networks,” inICML, 2017

work page 2017
[9]

Do vision transformers see like convolutional neural networks?,

Maithra Raghu, Thomas Unterthiner, et al., “Do vision transformers see like convolutional neural networks?,” in NeurIPS, 2021

work page 2021
[10]

Grad-eclip: Gradient- based visual and textual explanations for clip,

Chenyang Zhao, Kun Wang, et al., “Grad-eclip: Gradient- based visual and textual explanations for clip,” inICML, 2024

work page 2024
[11]

What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evalu- ation,

Michal Golovanevsky, William Rudman, et al., “What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evalu- ation,” inACL, 2025

work page 2025
[12]

Learning deep features for discriminative localization,

Bolei Zhou, Aditya Khosla, et al., “Learning deep features for discriminative localization,” inCVPR, 2016

work page 2016
[13]

Score-cam: Score- weighted visual explanations for convolutional neural networks,

Haofan Wang, Zifan Wang, et al., “Score-cam: Score- weighted visual explanations for convolutional neural networks,” inCVPRw, 2020

work page 2020
[14]

On pixel-wise explanations for non-linear classifier decisions by layer- wise relevance propagation,

Sebastian Bach, Alexander Binder, et al., “On pixel-wise explanations for non-linear classifier decisions by layer- wise relevance propagation,”PLOS ONE, 2015

work page 2015
[15]

Learning important features through propagating activation differences,

Avanti Shrikumar, Peyton Greenside, and Anshul Kun- daje, “Learning important features through propagating activation differences,” inICML, 2017

work page 2017
[16]

Dreaming to distill: Data-free knowledge transfer via deepinversion,

Hongxu Yin, Pavlo Molchanov, et al., “Dreaming to distill: Data-free knowledge transfer via deepinversion,” inCVPR, 2020

work page 2020
[17]

Gradvit: Gradient inversion of vision transformers,

Ali Hatamizadeh, Hongxu Yin, et al., “Gradvit: Gradient inversion of vision transformers,” inCVPR, 2022

work page 2022
[18]

Plug-in inversion: Model-agnostic inversion for vision with data augmenta- tions,

Amin Ghiasi, Hamid Kazemi, et al., “Plug-in inversion: Model-agnostic inversion for vision with data augmenta- tions,” inICML, 2022

work page 2022
[19]

The mind’s eye: Visualizing class- agnostic features of cnns,

Alexandros Stergiou, “The mind’s eye: Visualizing class- agnostic features of cnns,” inICIP, 2021

work page 2021
[20]

Understanding neural networks through deep visualization,

Jason Yosinski, Jeff Clune, et al., “Understanding neural networks through deep visualization,” inICMLw, 2015

work page 2015
[21]

Plug & play generative networks: Conditional iterative generation of images in latent space,

Anh Nguyen, Jeff Clune, et al., “Plug & play generative networks: Conditional iterative generation of images in latent space,” inCVPR, 2017

work page 2017
[22]

Imagenet: A large-scale hierarchical image database,

Jia Deng, Wei Dong, et al., “Imagenet: A large-scale hierarchical image database,” inCVPR, 2009

work page 2009
[23]

Craft: Concept recursive activation factorization for explainability,

Thomas Fel, Agustin Picard, et al., “Craft: Concept recursive activation factorization for explainability,” in CVPR, 2023

work page 2023
[24]

Universal sparse autoencoders: Interpretable cross-model concept alignment,

Harrish Thasarathan, Julian Forsyth, et al., “Universal sparse autoencoders: Interpretable cross-model concept alignment,” inICML, 2025

work page 2025
[25]

Lora: Low-rank adaptation of large language models.,

Edward J Hu, Yelong Shen, et al., “Lora: Low-rank adaptation of large language models.,”ICLR, 2022

work page 2022
[26]

Interpreting the second-order effects of neurons in clip,

Yossi Gandelsman, Alexei A. Efros, and Jacob Stein- hardt, “Interpreting the second-order effects of neurons in clip,” inICLR, 2025

work page 2025
[27]

Probing multimodal large language models for global and local semantic representations,

Mingxu Tao, Quzhe Huang, et al., “Probing multimodal large language models for global and local semantic representations,” inLREC-COLING, 2024

work page 2024
[28]

Linear expla- nations for individual neurons,

Tuomas Oikarinen and Tsui-Wei Weng, “Linear expla- nations for individual neurons,” inICML, 2024

work page 2024
[29]

From colors to classes: Emergence of concepts in vision transform- ers,

Teresa Dorszewski, Lenka T ˇetkov´a, et al., “From colors to classes: Emergence of concepts in vision transform- ers,”arxiv:2503.24071, 2025

work page arXiv 2025
[30]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, et al., “Deep residual learning for image recognition,” inCVPR, 2016

work page 2016
[31]

Rethink- ing the inception architecture for computer vision,

Christian Szegedy, Vincent Vanhoucke, et al., “Rethink- ing the inception architecture for computer vision,” in CVPR, 2016

work page 2016
[32]

Mobilenetv2: Inverted residuals and linear bottlenecks,

Mark Sandler, Andrew Howard, et al., “Mobilenetv2: Inverted residuals and linear bottlenecks,” inCVPR, 2018

work page 2018
[33]

Improved baselines with visual instruction tuning,

Haotian Liu, Chunyuan Li, et al., “Improved baselines with visual instruction tuning,” inCVPR, 2024

work page 2024
[34]

Learning transfer- able visual models from natural language supervision,

Alec Radford, Jong Wook Kim, et al., “Learning transfer- able visual models from natural language supervision,” inICML, 2021

work page 2021
[35]

Llama 3 model card,

AI@Meta, “Llama 3 model card,” 2024

work page 2024
[36]

The unreasonable effectiveness of deep features as a perceptual metric,

Richard Zhang, Phillip Isola, et al., “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018

work page 2018
[37]

Clipscore: A reference-free evaluation metric for image captioning,

Jack Hessel, Ari Holtzman, et al., “Clipscore: A reference-free evaluation metric for image captioning,” inEMNLPS, 2022

work page 2022
[38]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”NeurIPS, 2017

work page 2017
[39]

Improved tech- niques for training gans,

Tim Salimans, Ian Goodfellow, et al., “Improved tech- niques for training gans,”NeurIPS, 2016

work page 2016
[40]

Effectively unbiased fid inception score and where to find them,

Min Chong and David Forsyth, “Effectively unbiased fid inception score and where to find them,” inCVPR, 2020

work page 2020
[41]

Bleu: a method for automatic evaluation of machine translation,

Kishore Papineni, Salim Roukos, et al., “Bleu: a method for automatic evaluation of machine translation,” inACL, 2002

work page 2002
[42]

Meteor: An auto- matic metric for mt evaluation with improved correlation with human judgments,

Satanjeev Banerjee and Alon Lavie, “Meteor: An auto- matic metric for mt evaluation with improved correlation with human judgments,” inACLw, 2005

work page 2005
[43]

Rouge: A package for automatic evalu- ation of summaries,

Lin Chin-Yew, “Rouge: A package for automatic evalu- ation of summaries,” inTSBOw, 2004. APPENDIXA VIT INVERSION WITHMIMIC In addition to VLMs, we applyMIMICon Vision Trans- formers (ViTs) used on image classification objectives. For these experiments, we initialize an updatable input bvas in Sec. III. The synthesized image is then passed through the froz...

work page 2004
[44]

All other hyperparameters and training settings remain unchanged

Implementation Details:We follow the same experi- mental configuration as in the main inversion setup, with two modifications: the optimization is performed for3000 iterations, and the scaling factorsα 1, α2, α3, β2, γ1,andγ 2 are set to1.0whileβ 1 is set to1×10 −4. All other hyperparameters and training settings remain unchanged

work page
[45]

Results:As shown in Fig. B-1, replacing theℓ 2-based base feature loss with a KL-divergence formulation produces synthesized images that are less noisy and capture more distinctive semantic details, such as fine textures and clearer object boundaries (e.g., the fins of the goldfish, the body structure of the retriever, and the braided form of the pretzel)...

work page
[46]

Templates used are shown in Fig

V arying target length:We evaluate the impact of [target]length using fixed templates with|ˆy| ∈ {4,5,7}. Templates used are shown in Fig. B-2 and the corresponding quantitative and qualitative results are shown in Table II and Figs. 5 and 6. Despite variation in length, the reconstructed images remain visually consistent, and the predicted outputs preser...

work page
[47]

V arying target description:We further examine the effect of using more natural, free-form descriptions from the templates shown in Fig. B-3. Table B-II comparesdescriptions D1 (Fig. B-4a) andD 2 (Fig. B-4b) with differing lexical structures. We observe that text similarity metrics such as BLEU and ROUGE-L are slightly lower than those observed in the fix...

work page
[48]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[49]

Cats and dogs,

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V . Jawahar, “Cats and dogs,” in IEEE Xplore, 2012

work page 2012

[1] [1]

Faithlm: Towards faithful explanations for large language models,

Yu-Neng Chuang, Guanchu Wang, et al., “Faithlm: Towards faithful explanations for large language models,” arxiv:2402.04678, 2024

work page arXiv 2024

[2] [2]

Selfie: Self-interpretation of large language model em- beddings,

Haozhe Chen, Carl V ondrick, and Chengzhi Mao, “Selfie: Self-interpretation of large language model em- beddings,” inICML, 2024

work page 2024

[3] [3]

Patch- scopes: A unifying framework for inspecting hidden representations of language models,

Asma Ghandeharioun, Avi Caciularu, et al., “Patch- scopes: A unifying framework for inspecting hidden representations of language models,” inICML, 2024

work page 2024

[4] [4]

Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

Roger Grosse, Juhan Bae, et al., “Studying large lan- guage model generalization with influence functions,” arxiv:2308.03296, 2023

work page arXiv 2023

[5] [5]

Locating and editing factual associations in gpt,

Kevin Meng, David Bau, et al., “Locating and editing factual associations in gpt,” inNeurIPS, 2023

work page 2023

[6] [6]

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez and Been Kim, “Towards a rigorous science of interpretable machine learning,” arxiv:1702.08608, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

Ramprasaath R Selvaraju, Michael Cogswell, et al., “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inICCV, 2017

work page 2017

[8] [8]

Ax- iomatic attribution for deep networks,

Mukund Sundararajan, Ankur Taly, and Qiqi Yan, “Ax- iomatic attribution for deep networks,” inICML, 2017

work page 2017

[9] [9]

Do vision transformers see like convolutional neural networks?,

Maithra Raghu, Thomas Unterthiner, et al., “Do vision transformers see like convolutional neural networks?,” in NeurIPS, 2021

work page 2021

[10] [10]

Grad-eclip: Gradient- based visual and textual explanations for clip,

Chenyang Zhao, Kun Wang, et al., “Grad-eclip: Gradient- based visual and textual explanations for clip,” inICML, 2024

work page 2024

[11] [11]

What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evalu- ation,

Michal Golovanevsky, William Rudman, et al., “What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evalu- ation,” inACL, 2025

work page 2025

[12] [12]

Learning deep features for discriminative localization,

Bolei Zhou, Aditya Khosla, et al., “Learning deep features for discriminative localization,” inCVPR, 2016

work page 2016

[13] [13]

Score-cam: Score- weighted visual explanations for convolutional neural networks,

Haofan Wang, Zifan Wang, et al., “Score-cam: Score- weighted visual explanations for convolutional neural networks,” inCVPRw, 2020

work page 2020

[14] [14]

On pixel-wise explanations for non-linear classifier decisions by layer- wise relevance propagation,

Sebastian Bach, Alexander Binder, et al., “On pixel-wise explanations for non-linear classifier decisions by layer- wise relevance propagation,”PLOS ONE, 2015

work page 2015

[15] [15]

Learning important features through propagating activation differences,

Avanti Shrikumar, Peyton Greenside, and Anshul Kun- daje, “Learning important features through propagating activation differences,” inICML, 2017

work page 2017

[16] [16]

Dreaming to distill: Data-free knowledge transfer via deepinversion,

Hongxu Yin, Pavlo Molchanov, et al., “Dreaming to distill: Data-free knowledge transfer via deepinversion,” inCVPR, 2020

work page 2020

[17] [17]

Gradvit: Gradient inversion of vision transformers,

Ali Hatamizadeh, Hongxu Yin, et al., “Gradvit: Gradient inversion of vision transformers,” inCVPR, 2022

work page 2022

[18] [18]

Plug-in inversion: Model-agnostic inversion for vision with data augmenta- tions,

Amin Ghiasi, Hamid Kazemi, et al., “Plug-in inversion: Model-agnostic inversion for vision with data augmenta- tions,” inICML, 2022

work page 2022

[19] [19]

The mind’s eye: Visualizing class- agnostic features of cnns,

Alexandros Stergiou, “The mind’s eye: Visualizing class- agnostic features of cnns,” inICIP, 2021

work page 2021

[20] [20]

Understanding neural networks through deep visualization,

Jason Yosinski, Jeff Clune, et al., “Understanding neural networks through deep visualization,” inICMLw, 2015

work page 2015

[21] [21]

Plug & play generative networks: Conditional iterative generation of images in latent space,

Anh Nguyen, Jeff Clune, et al., “Plug & play generative networks: Conditional iterative generation of images in latent space,” inCVPR, 2017

work page 2017

[22] [22]

Imagenet: A large-scale hierarchical image database,

Jia Deng, Wei Dong, et al., “Imagenet: A large-scale hierarchical image database,” inCVPR, 2009

work page 2009

[23] [23]

Craft: Concept recursive activation factorization for explainability,

Thomas Fel, Agustin Picard, et al., “Craft: Concept recursive activation factorization for explainability,” in CVPR, 2023

work page 2023

[24] [24]

Universal sparse autoencoders: Interpretable cross-model concept alignment,

Harrish Thasarathan, Julian Forsyth, et al., “Universal sparse autoencoders: Interpretable cross-model concept alignment,” inICML, 2025

work page 2025

[25] [25]

Lora: Low-rank adaptation of large language models.,

Edward J Hu, Yelong Shen, et al., “Lora: Low-rank adaptation of large language models.,”ICLR, 2022

work page 2022

[26] [26]

Interpreting the second-order effects of neurons in clip,

Yossi Gandelsman, Alexei A. Efros, and Jacob Stein- hardt, “Interpreting the second-order effects of neurons in clip,” inICLR, 2025

work page 2025

[27] [27]

Probing multimodal large language models for global and local semantic representations,

Mingxu Tao, Quzhe Huang, et al., “Probing multimodal large language models for global and local semantic representations,” inLREC-COLING, 2024

work page 2024

[28] [28]

Linear expla- nations for individual neurons,

Tuomas Oikarinen and Tsui-Wei Weng, “Linear expla- nations for individual neurons,” inICML, 2024

work page 2024

[29] [29]

From colors to classes: Emergence of concepts in vision transform- ers,

Teresa Dorszewski, Lenka T ˇetkov´a, et al., “From colors to classes: Emergence of concepts in vision transform- ers,”arxiv:2503.24071, 2025

work page arXiv 2025

[30] [30]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, et al., “Deep residual learning for image recognition,” inCVPR, 2016

work page 2016

[31] [31]

Rethink- ing the inception architecture for computer vision,

Christian Szegedy, Vincent Vanhoucke, et al., “Rethink- ing the inception architecture for computer vision,” in CVPR, 2016

work page 2016

[32] [32]

Mobilenetv2: Inverted residuals and linear bottlenecks,

Mark Sandler, Andrew Howard, et al., “Mobilenetv2: Inverted residuals and linear bottlenecks,” inCVPR, 2018

work page 2018

[33] [33]

Improved baselines with visual instruction tuning,

Haotian Liu, Chunyuan Li, et al., “Improved baselines with visual instruction tuning,” inCVPR, 2024

work page 2024

[34] [34]

Learning transfer- able visual models from natural language supervision,

Alec Radford, Jong Wook Kim, et al., “Learning transfer- able visual models from natural language supervision,” inICML, 2021

work page 2021

[35] [35]

Llama 3 model card,

AI@Meta, “Llama 3 model card,” 2024

work page 2024

[36] [36]

The unreasonable effectiveness of deep features as a perceptual metric,

Richard Zhang, Phillip Isola, et al., “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018

work page 2018

[37] [37]

Clipscore: A reference-free evaluation metric for image captioning,

Jack Hessel, Ari Holtzman, et al., “Clipscore: A reference-free evaluation metric for image captioning,” inEMNLPS, 2022

work page 2022

[38] [38]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”NeurIPS, 2017

work page 2017

[39] [39]

Improved tech- niques for training gans,

Tim Salimans, Ian Goodfellow, et al., “Improved tech- niques for training gans,”NeurIPS, 2016

work page 2016

[40] [40]

Effectively unbiased fid inception score and where to find them,

Min Chong and David Forsyth, “Effectively unbiased fid inception score and where to find them,” inCVPR, 2020

work page 2020

[41] [41]

Bleu: a method for automatic evaluation of machine translation,

Kishore Papineni, Salim Roukos, et al., “Bleu: a method for automatic evaluation of machine translation,” inACL, 2002

work page 2002

[42] [42]

Meteor: An auto- matic metric for mt evaluation with improved correlation with human judgments,

Satanjeev Banerjee and Alon Lavie, “Meteor: An auto- matic metric for mt evaluation with improved correlation with human judgments,” inACLw, 2005

work page 2005

[43] [43]

Rouge: A package for automatic evalu- ation of summaries,

Lin Chin-Yew, “Rouge: A package for automatic evalu- ation of summaries,” inTSBOw, 2004. APPENDIXA VIT INVERSION WITHMIMIC In addition to VLMs, we applyMIMICon Vision Trans- formers (ViTs) used on image classification objectives. For these experiments, we initialize an updatable input bvas in Sec. III. The synthesized image is then passed through the froz...

work page 2004

[44] [44]

All other hyperparameters and training settings remain unchanged

Implementation Details:We follow the same experi- mental configuration as in the main inversion setup, with two modifications: the optimization is performed for3000 iterations, and the scaling factorsα 1, α2, α3, β2, γ1,andγ 2 are set to1.0whileβ 1 is set to1×10 −4. All other hyperparameters and training settings remain unchanged

work page

[45] [45]

Results:As shown in Fig. B-1, replacing theℓ 2-based base feature loss with a KL-divergence formulation produces synthesized images that are less noisy and capture more distinctive semantic details, such as fine textures and clearer object boundaries (e.g., the fins of the goldfish, the body structure of the retriever, and the braided form of the pretzel)...

work page

[46] [46]

Templates used are shown in Fig

V arying target length:We evaluate the impact of [target]length using fixed templates with|ˆy| ∈ {4,5,7}. Templates used are shown in Fig. B-2 and the corresponding quantitative and qualitative results are shown in Table II and Figs. 5 and 6. Despite variation in length, the reconstructed images remain visually consistent, and the predicted outputs preser...

work page

[47] [47]

V arying target description:We further examine the effect of using more natural, free-form descriptions from the templates shown in Fig. B-3. Table B-II comparesdescriptions D1 (Fig. B-4a) andD 2 (Fig. B-4b) with differing lexical structures. We observe that text similarity metrics such as BLEU and ROUGE-L are slightly lower than those observed in the fix...

work page

[48] [48]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[49] [49]

Cats and dogs,

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V . Jawahar, “Cats and dogs,” in IEEE Xplore, 2012

work page 2012