A 540-image benchmark with four phrasing variants per image reveals VLMs degrade when text leakage is minimized, with no-image ablations confirming reliance and GRPO post-training yielding gains that transfer to held-out data.
Klaus Krippendorff
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
A compliance-scored best-of-N orchestration layer for multimodal document generation reports 91% compliance at 5 attempts in 20 seconds and +11 percentage point win rate gains in aggregate operational data for payments dispute defense.
citing papers explorer
-
Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark
A 540-image benchmark with four phrasing variants per image reveals VLMs degrade when text leakage is minimized, with no-image ablations confirming reliance and GRPO post-training yielding gains that transfer to held-out data.