Two effects, one trigger: on the modality gap, object bias, and information imbalance in contrastive vision-language representation learning

Schrodi, S · 2024 · arXiv 2404.07983

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

LoRA adapters fix collapsed visual CLS token attention in CLIP for superior cross-domain few-shot learning, and the new Semantic Probe framework revives prompt methods to reach state-of-the-art on four benchmarks.

Counting to Four is still a Chore for VLMs

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

cs.CV · 2026-05-05 · unverdicted · novelty 5.0 · 3 refs

VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended settings.

citing papers explorer

Showing 3 of 3 citing papers.

Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning cs.CV · 2026-05-12 · unverdicted · none · ref 29
LoRA adapters fix collapsed visual CLS token attention in CLIP for superior cross-domain few-shot learning, and the new Semantic Probe framework revives prompt methods to reach state-of-the-art on four benchmarks.
Counting to Four is still a Chore for VLMs cs.CV · 2026-04-11 · unverdicted · none · ref 15
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection cs.CV · 2026-05-05 · unverdicted · none · ref 39 · 3 links
VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended settings.

Two effects, one trigger: on the modality gap, object bias, and information imbalance in contrastive vision-language representation learning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer