Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

Hongjian Zou; Qi Ding; Xiaoxin Chen; Yixuan Liao; Yue Ge

arxiv: 2604.13054 · v1 · submitted 2026-03-17 · 💻 cs.CL · cs.AI· cs.CV

Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

Hongjian Zou , Yue Ge , Qi Ding , Yixuan Liao , Xiaoxin Chen This is my paper

Pith reviewed 2026-05-15 09:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords multimodal LLMsknowledge densityscalingVQAimage captionssemantic coverage

0 comments

The pith

The bottleneck in multimodal scaling is knowledge density in training data rather than task format.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that multimodal large language models scale poorly because their training data lacks sufficient knowledge coverage, not because of insufficient task variety. It shows that visual question answering tasks add almost no new semantic information beyond what is already in image captions, as VQA answers can be predicted from captions alone with little performance drop. By instead enriching captions with structured knowledge and injecting cross-modal facts, the authors achieve consistent gains on multimodal and downstream tasks. Performance tracks semantic coverage more closely than the number of different tasks used. These results point to knowledge-centric data design as the key to better scaling in multimodal models.

Core claim

VQA supervision contributes little incremental semantic information beyond image captions, and increasing knowledge density through caption enrichment leads to consistent performance improvements, with performance correlating more strongly with semantic coverage than with task diversity.

What carries the argument

The reconstruction of VQA signals from captions to measure incremental information, combined with structured caption enrichment for knowledge injection.

If this is right

Structured caption enrichment improves performance across multimodal benchmarks.
Cross-modal knowledge injection yields gains without changing task format.
Semantic coverage is a stronger predictor of scaling success than task diversity.
Current MLLMs underperform primarily due to insufficient knowledge coverage in training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work should prioritize data curation that maximizes knowledge density over collecting more diverse task formats.
This reframes multimodal scaling as a data quality problem similar to text-only LLM scaling.
Models trained on enriched captions may generalize better to unseen visual tasks that require factual recall.

Load-bearing premise

That VQA signals can be fully reconstructed from captions with negligible performance loss, implying captions already contain essentially all relevant semantic information.

What would settle it

An experiment showing that adding VQA data without additional knowledge density still improves performance on held-out tasks where captions alone fail.

Figures

Figures reproduced from arXiv: 2604.13054 by Hongjian Zou, Qi Ding, Xiaoxin Chen, Yixuan Liao, Yue Ge.

**Figure 1.** Figure 1: Illustrative example showing how the semantic information required for answering the VQA question is already contained in the caption. The VQA pair primarily reorganizes this information into a question–answer format rather than introducing new semantic content. 3.4. Discussion Across all benchmark categories, the three supervision strategies produce highly comparable results, with performance differenc… view at source ↗

**Figure 2.** Figure 2: Conceptual comparison between task-centric multimodal supervision and knowledge-centric training. Traditional pipelines increase task diversity through multiple supervision formats (e.g., captioning, VQA), while our approach increases semantic coverage by constructing knowledge-rich multimodal data. 4.2. Knowledge-Density Interventions via Cross-Modal Knowledge Injection 4.2.1. IMAGE PAIR CONSTRUCTION T… view at source ↗

**Figure 3.** Figure 3: Data construction pipeline for knowledge-centric multimodal training. An MLLM annotates semantic attributes for each image; images are then paired by semantic similarity and contrast, filtered for coherence, and captioned. Unpaired images follow a separate caption generation branch. Both streams yield a knowledge-rich corpus for MLLM training. model generates semantic descriptors including: • coarse seman… view at source ↗

**Figure 4.** Figure 4: Comparison of knowledge density between training paradigms. Pair-Caption-v2 samples yield an average of 32 knowledge elements per sample, compared to 22 for standard VQACaption supervision — a 45% increase. Knowledge elements are extracted by prompting an LLM to identify objects, attributes, relationships, events, and categories present in each training sample. isons, and contextual explanations, thereby … view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have achieved rapid progress, yet their scaling behavior remains less clearly characterized and often less predictable than that of text-only LLMs. Increasing model size and task diversity often yields diminishing returns. In this work, we argue that the primary bottleneck in multimodal scaling is not task format, but knowledge density in training data. We first show that task-specific supervision such as Visual Question Answering (VQA) contributes little incremental semantic information beyond image captions: VQA signals can be reconstructed from captions with negligible performance loss. We then demonstrate that increasing knowledge density -- through structured caption enrichment and cross-modal knowledge injection -- leads to consistent performance improvements across multimodal and downstream benchmarks. Across controlled experiments, performance correlates more strongly with semantic coverage than with task diversity. These findings suggest that current MLLMs fail to scale primarily because training data lacks sufficient knowledge coverage. We advocate for knowledge-centric multimodal training as a principled foundation for scalable multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows VQA can be mostly recovered from captions alone, but the claim that this proves knowledge density is the sole bottleneck rests on experiments that may let models use priors rather than data content.

read the letter

The central result here is that training on enriched captions recovers most VQA performance without explicit question-answer pairs, and that adding structured knowledge to captions improves downstream scores more reliably than adding task variety. That observation is useful because it pushes back against the default habit of just scaling up task diversity in multimodal data mixes. The experiments on caption enrichment and cross-modal injection are straightforward and show consistent trends across benchmarks, which is the part worth taking seriously if the numbers hold under tighter controls. What the work does cleanly is demonstrate that performance tracks semantic coverage more than the presence of VQA format itself. The reconstruction setup is the freshest piece: it directly tests whether VQA adds incremental signal beyond what captions already provide. On the soft side, the reconstruction result could be inflated if the model is drawing on parametric knowledge or cross-task patterns learned earlier rather than extracting everything from the caption text alone. The abstract and available details do not spell out whether the evaluation isolates the caption content or allows the model to reason beyond it, which leaves the interpretation open. There is also no mention of statistical significance, exact data splits, or model-size controls, so it is hard to judge how robust the correlations are. The paper is aimed at groups already running large-scale MLLM pretraining and looking for data-construction heuristics. It is worth a serious referee pass because the empirical angle is concrete and falsifiable, even if the causal story needs tighter experiments to pin down whether captions truly contain the full semantic load or whether the model is doing extra work. I would send it out rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that the primary bottleneck in multimodal LLM scaling is insufficient knowledge density in training data rather than task format. It supports this via experiments showing VQA signals can be reconstructed from image captions with negligible performance loss, followed by ablations demonstrating that structured caption enrichment and cross-modal knowledge injection yield consistent gains across benchmarks, with performance correlating more strongly with semantic coverage than task diversity.

Significance. If the central empirical claims hold after clarification, the work offers a useful reframing for multimodal training by prioritizing data enrichment over task diversity. This could inform more efficient scaling strategies and shift focus from format engineering to coverage metrics. The controlled ablations and reconstruction tests are a methodological strength that, if made fully verifiable, would strengthen the case for knowledge-centric approaches.

major comments (2)

[VQA reconstruction experiments] VQA reconstruction experiments (described in the methods and results sections): the claim of negligible performance loss when reconstructing VQA from captions requires explicit clarification on whether the answering model is trained exclusively on caption data or leverages an already-trained MLLM's parametric knowledge and cross-task priors. If the latter, the result may reflect model inference rather than captions containing all task-relevant semantics, which is load-bearing for the central claim that knowledge density—not task format—is the bottleneck.
[Experimental setup and ablations] Experimental setup and ablations (results section): the manuscript lacks details on data splits, controls for model size, and statistical significance testing for the reported performance improvements from caption enrichment. Without these, it is difficult to confirm that gains are attributable to knowledge density rather than confounding factors, undermining verification of the correlation between semantic coverage and benchmark scores.

minor comments (2)

[Abstract] Abstract: the terms 'knowledge density' and 'semantic coverage' are used without a precise operational definition or quantitative proxy (e.g., unique facts or entities), which could be clarified for readers.
[Results] The paper would benefit from a table summarizing the exact performance deltas and controls across the caption-only vs. enriched regimes to improve readability of the key comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help strengthen the clarity and verifiability of our work. We address each major comment point by point below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [VQA reconstruction experiments] VQA reconstruction experiments (described in the methods and results sections): the claim of negligible performance loss when reconstructing VQA from captions requires explicit clarification on whether the answering model is trained exclusively on caption data or leverages an already-trained MLLM's parametric knowledge and cross-task priors. If the latter, the result may reflect model inference rather than captions containing all task-relevant semantics, which is load-bearing for the central claim that knowledge density—not task format—is the bottleneck.

Authors: We appreciate the referee's emphasis on this critical distinction. In the VQA reconstruction experiments, the answering model is trained exclusively on caption-derived data without access to pre-trained MLLM parametric knowledge or cross-task VQA priors; a base language model is fine-tuned solely on the enriched captions to perform the VQA task. This setup isolates the semantic coverage provided by captions, supporting our claim that task format adds negligible incremental information. We will revise the methods section to explicitly state the training protocol, including the absence of pre-trained multimodal priors, and include pseudocode for the reconstruction procedure to make the isolation fully transparent. revision: yes
Referee: [Experimental setup and ablations] Experimental setup and ablations (results section): the manuscript lacks details on data splits, controls for model size, and statistical significance testing for the reported performance improvements from caption enrichment. Without these, it is difficult to confirm that gains are attributable to knowledge density rather than confounding factors, undermining verification of the correlation between semantic coverage and benchmark scores.

Authors: We agree that these details are essential for rigorous verification. We will expand the experimental setup subsection to specify the exact data splits (e.g., train/validation/test ratios and how they were constructed to avoid leakage), confirm that all ablations use identical model sizes and architectures as controls, and report statistical significance via paired t-tests or bootstrap confidence intervals on the performance deltas. These additions will directly address potential confounds and reinforce the correlation between semantic coverage metrics and benchmark gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on direct empirical comparisons: VQA signals reconstructed from captions yield negligible performance loss, and performance correlates more strongly with semantic coverage than task diversity. These are presented as experimental outcomes across controlled training regimes rather than derivations, fitted parameters renamed as predictions, or self-citation chains. No equations appear that reduce results to inputs by construction, and the abstract and described methods rely on observable benchmarks instead of definitional equivalence or ansatz smuggling. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central argument rests on the domain assumption that semantic coverage is adequately captured by caption reconstruction accuracy and that enrichment operations increase genuine knowledge density rather than superficial features.

axioms (1)

domain assumption VQA signals can be reconstructed from image captions with negligible performance loss
This premise underpins the claim that task format is not the limiting factor.

pith-pipeline@v0.9.0 · 5480 in / 1101 out tokens · 63492 ms · 2026-05-15T09:34:54.336470+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

All questions must be answerable with a single, clear, and unambiguous answer that can be directly derived from the image description, without relying on common-sense supplementation or implicit reasoning

work page
[2]

The only allowed information types are: -- explicitly mentioned objects -- directly observable attributes (such as color, shape, material, state, etc.) -- explicitly stated quantities, positional relationships, spatial relationships, or interaction relationships (e.g., ‘‘next to ...’’, ‘‘in front of ...’’)

work page
[3]

The following content is strictly prohibited: -- emotions, atmosphere, or aesthetic evaluations -- intentions, psychological states, or future behaviors -- uncertain expressions (such as ‘‘may,’’ ‘‘seems,’’ ‘‘speculates,’’ etc.)

work page
[4]

Add relevant content to make the entry more complete and quickly level up

For web pages or interface screenshots, do not answer questions about the following information by default, unless they themselves constitute the main content: -- top navigation bar or menu bar -- footer area -- registration information, copyright notices, or legal disclaimers -- auxiliary function areas, statistics, or decorative elements -- the ‘‘entry ...

work page
[5]

The total number of Q&A pairs should be controlled between 5 and 15, prioritizing quality over quantity

work page
[6]

[Global Question Requirements]

Questions must include both global questions and detail-oriented questions, with detail-oriented questions significantly outnumbering global questions. [Global Question Requirements]

work page
[7]

Include at least one global question that prompts an overall description of the main content of the entire image

work page
[8]

The questions should be general and summarizing, with answers that can be 14 Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling understood as a concise version of the image description

work page
[9]

[Detail Question Requirements]

Answers should be one natural paragraph, shorter than the original description but significantly longer than a single-sentence summary, avoiding bullet-point or itemized format. [Detail Question Requirements]

work page
[10]

Detail questions should target specific and explicit details in the image description, such as: -- attributes, states, or quantities of individual objects -- clearly described spatial positions or relative relationships -- explicitly appearing interactions between people or objects

work page
[11]

Detail questions should aim to cover different key information points in the image description, without requiring exhaustive coverage of all details

work page
[12]

For page-like images, focus on the main text or core information rather than layout or decorative elements

Detail questions should prioritize information that is valuable for understanding the image content and avoid unnecessary or low training-value details. For page-like images, focus on the main text or core information rather than layout or decorative elements

work page
[13]

Different detail questions should correspond to different information points or attribute dimensions, avoiding repetition or high overlap

work page
[14]

Answers to detail questions must be accurate and concise, mainly consisting of nouns, quantities, or explicit relationships, without additional explanation, modification, or inference

work page
[15]

question

Avoid generating the following types of detail questions: -- repetitive or highly similar questions -- low-value questions generated merely to increase quantity [Output Format] Please output a JSON list in which each element contains ‘‘question’’ and ‘‘answer’’ fields: ["question": "specific question", "answer": "corresponding answer"] Please return only ...

work page
[16]

Core Objective: Extract key information at the conceptual, categorical, and functional levels from the image, to support the construction of a semantically related image database

work page
[17]

General Rules: (1) Understand first, then abstract: Fully understand the image content (including any text), then summarize and abstract concrete content into concepts, categories, and functions. (2) Concept first: Always focus on: -- what it is (theme/object) -- what it is about (concept/knowledge) -- why it exists (function/purpose) (3) Avoid surface fe...

work page
[18]

hierarchical semantic information

Fields and Instructions: (1) Type Theme (required): The macro-level presentation form. Examples: Nature and Environment; People and Society; Virtual and Art; Academic Problem; Knowledge Introduction/Document; Screenshot/Interface; Others. (2) Domain Direction (required): 16 Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal S...

work page
[19]

Focus on the core; ignore non-essential elements: The description must focus entirely on the primary content that conveys core information, such as main text, titles, key data, central arguments, or critical facts. 17 Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling For webpage or interface screenshots, strictly ign...

work page
[20]

Detailed informational grounding: -- Provide detailed transcription of key textual information; avoid vague expressions such as ‘‘for example’’ or ‘‘etc.’’ -- Clearly describe logical relationships among core elements, without referencing container-like section names

work page
[21]

Do not speculate or evaluate

Objective description: Only describe observable facts. Do not speculate or evaluate. [Caption Generation Structure Requirements]

work page
[22]

High-level overview: Identify the macro-level theme, field, or concept shared by both images

work page
[23]

Description of Image A: Introduce naturally (e.g., ‘‘One of the materials involves...’’) and provide a complete and detailed description of: -- core subject or theme -- key textual content -- main aspects of the content -- (if applicable) its role or significance

work page
[24]

Description of Image B: Introduce similarly (e.g., ‘‘The other material involves...’’) and provide an equally detailed description

work page
[25]

Comparative and relational analysis: -- Core similarities: identify shared subject matter or conceptual domain -- Key differences: analyze differences in subject, context, roles, themes, or audience, grounded in content -- Complementarity: explain how differences together form a richer structure or understanding

work page
[26]

Conclusion: Briefly state the significance of the comparison for understanding the broader theme. [Language and Output] -- Use objective, concise, and information-dense English -- Do not refer to ‘‘first image’’ or ‘‘second image’’ -- Do not mention layout, colors, or formatting -- Output a single coherent paragraph only Note: Output only the generated ca...

work page
[27]

fact": atomic fact (short description)

Fact: -- Must be explicitly stated in the text; no inference is allowed -- Do not include any attributes, objects, or scenes that are not mentioned in the text -- Each Fact should be a JSON object with: "fact": atomic fact (short description) "level": "L1"

work page
[28]

Fact" and

Abstract: -- Must be explicitly expressed in the text -- Do not add any emotions, atmosphere, or interpretations that are not present in the 19 Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling text -- Do not generate inferred abstract information -- Output as a list of short sentences Important Constraints: -- If an...

work page

[1] [1]

All questions must be answerable with a single, clear, and unambiguous answer that can be directly derived from the image description, without relying on common-sense supplementation or implicit reasoning

work page

[2] [2]

The only allowed information types are: -- explicitly mentioned objects -- directly observable attributes (such as color, shape, material, state, etc.) -- explicitly stated quantities, positional relationships, spatial relationships, or interaction relationships (e.g., ‘‘next to ...’’, ‘‘in front of ...’’)

work page

[3] [3]

The following content is strictly prohibited: -- emotions, atmosphere, or aesthetic evaluations -- intentions, psychological states, or future behaviors -- uncertain expressions (such as ‘‘may,’’ ‘‘seems,’’ ‘‘speculates,’’ etc.)

work page

[4] [4]

Add relevant content to make the entry more complete and quickly level up

For web pages or interface screenshots, do not answer questions about the following information by default, unless they themselves constitute the main content: -- top navigation bar or menu bar -- footer area -- registration information, copyright notices, or legal disclaimers -- auxiliary function areas, statistics, or decorative elements -- the ‘‘entry ...

work page

[5] [5]

The total number of Q&A pairs should be controlled between 5 and 15, prioritizing quality over quantity

work page

[6] [6]

[Global Question Requirements]

Questions must include both global questions and detail-oriented questions, with detail-oriented questions significantly outnumbering global questions. [Global Question Requirements]

work page

[7] [7]

Include at least one global question that prompts an overall description of the main content of the entire image

work page

[8] [8]

The questions should be general and summarizing, with answers that can be 14 Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling understood as a concise version of the image description

work page

[9] [9]

[Detail Question Requirements]

Answers should be one natural paragraph, shorter than the original description but significantly longer than a single-sentence summary, avoiding bullet-point or itemized format. [Detail Question Requirements]

work page

[10] [10]

Detail questions should target specific and explicit details in the image description, such as: -- attributes, states, or quantities of individual objects -- clearly described spatial positions or relative relationships -- explicitly appearing interactions between people or objects

work page

[11] [11]

Detail questions should aim to cover different key information points in the image description, without requiring exhaustive coverage of all details

work page

[12] [12]

For page-like images, focus on the main text or core information rather than layout or decorative elements

Detail questions should prioritize information that is valuable for understanding the image content and avoid unnecessary or low training-value details. For page-like images, focus on the main text or core information rather than layout or decorative elements

work page

[13] [13]

Different detail questions should correspond to different information points or attribute dimensions, avoiding repetition or high overlap

work page

[14] [14]

Answers to detail questions must be accurate and concise, mainly consisting of nouns, quantities, or explicit relationships, without additional explanation, modification, or inference

work page

[15] [15]

question

Avoid generating the following types of detail questions: -- repetitive or highly similar questions -- low-value questions generated merely to increase quantity [Output Format] Please output a JSON list in which each element contains ‘‘question’’ and ‘‘answer’’ fields: ["question": "specific question", "answer": "corresponding answer"] Please return only ...

work page

[16] [16]

Core Objective: Extract key information at the conceptual, categorical, and functional levels from the image, to support the construction of a semantically related image database

work page

[17] [17]

General Rules: (1) Understand first, then abstract: Fully understand the image content (including any text), then summarize and abstract concrete content into concepts, categories, and functions. (2) Concept first: Always focus on: -- what it is (theme/object) -- what it is about (concept/knowledge) -- why it exists (function/purpose) (3) Avoid surface fe...

work page

[18] [18]

hierarchical semantic information

Fields and Instructions: (1) Type Theme (required): The macro-level presentation form. Examples: Nature and Environment; People and Society; Virtual and Art; Academic Problem; Knowledge Introduction/Document; Screenshot/Interface; Others. (2) Domain Direction (required): 16 Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal S...

work page

[19] [19]

Focus on the core; ignore non-essential elements: The description must focus entirely on the primary content that conveys core information, such as main text, titles, key data, central arguments, or critical facts. 17 Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling For webpage or interface screenshots, strictly ign...

work page

[20] [20]

Detailed informational grounding: -- Provide detailed transcription of key textual information; avoid vague expressions such as ‘‘for example’’ or ‘‘etc.’’ -- Clearly describe logical relationships among core elements, without referencing container-like section names

work page

[21] [21]

Do not speculate or evaluate

Objective description: Only describe observable facts. Do not speculate or evaluate. [Caption Generation Structure Requirements]

work page

[22] [22]

High-level overview: Identify the macro-level theme, field, or concept shared by both images

work page

[23] [23]

Description of Image A: Introduce naturally (e.g., ‘‘One of the materials involves...’’) and provide a complete and detailed description of: -- core subject or theme -- key textual content -- main aspects of the content -- (if applicable) its role or significance

work page

[24] [24]

Description of Image B: Introduce similarly (e.g., ‘‘The other material involves...’’) and provide an equally detailed description

work page

[25] [25]

Comparative and relational analysis: -- Core similarities: identify shared subject matter or conceptual domain -- Key differences: analyze differences in subject, context, roles, themes, or audience, grounded in content -- Complementarity: explain how differences together form a richer structure or understanding

work page

[26] [26]

Conclusion: Briefly state the significance of the comparison for understanding the broader theme. [Language and Output] -- Use objective, concise, and information-dense English -- Do not refer to ‘‘first image’’ or ‘‘second image’’ -- Do not mention layout, colors, or formatting -- Output a single coherent paragraph only Note: Output only the generated ca...

work page

[27] [27]

fact": atomic fact (short description)

Fact: -- Must be explicitly stated in the text; no inference is allowed -- Do not include any attributes, objects, or scenes that are not mentioned in the text -- Each Fact should be a JSON object with: "fact": atomic fact (short description) "level": "L1"

work page

[28] [28]

Fact" and

Abstract: -- Must be explicitly expressed in the text -- Do not add any emotions, atmosphere, or interpretations that are not present in the 19 Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling text -- Do not generate inferred abstract information -- Output as a list of short sentences Important Constraints: -- If an...

work page