Knowledge density in image captions, not task format diversity, is the primary driver of multimodal LLM scaling performance.
Add relevant content to make the entry more complete and quickly level up
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling
Knowledge density in image captions, not task format diversity, is the primary driver of multimodal LLM scaling performance.