In- terpreting the linear structure of vision-language model embedding spaces

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, Stephanie Gil · 2025 · arXiv 2504.11695

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.

citing papers explorer

Showing 3 of 3 citing papers.

Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models cs.LG · 2026-05-07 · unverdicted · none · ref 39
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs cs.CV · 2026-05-11 · unverdicted · none · ref 30
Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 14
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.

In- terpreting the linear structure of vision-language model embedding spaces

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer