pith. sign in

Two effects, one trigger: on the modality gap, object bias, and information imbalance in contrastive vision-language representation learning

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

citation-role summary

background 2

citation-polarity summary

fields

cs.CV 3

years

2026 3

verdicts

UNVERDICTED 3

roles

background 2

polarities

background 2

representative citing papers

Counting to Four is still a Chore for VLMs

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

cs.CV · 2026-05-05 · unverdicted · novelty 5.0 · 3 refs

VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended settings.

citing papers explorer

Showing 3 of 3 citing papers.

  • Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning cs.CV · 2026-05-12 · unverdicted · none · ref 29

    LoRA adapters fix collapsed visual CLS token attention in CLIP for superior cross-domain few-shot learning, and the new Semantic Probe framework revives prompt methods to reach state-of-the-art on four benchmarks.

  • Counting to Four is still a Chore for VLMs cs.CV · 2026-04-11 · unverdicted · none · ref 15

    VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

  • VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection cs.CV · 2026-05-05 · unverdicted · none · ref 39 · 3 links

    VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended settings.