DiffSpot benchmark shows frontier VLMs detect only 40.7% of single-CSS-property changes in web interfaces, with hard-tier recall below 23% and difficulty varying by property rather than pixel magnitude.
conservative but competent
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?
DiffSpot benchmark shows frontier VLMs detect only 40.7% of single-CSS-property changes in web interfaces, with hard-tier recall below 23% and difficulty varying by property rather than pixel magnitude.