pith. sign in

arxiv: 2604.09531 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI· cs.CL

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords synthetic data generationvision language modelsvisual perceptionvisual question answeringtext to image synthesisspatial reasoningdata augmentation
0
0 comments X

The pith

Synthetic data generated automatically from task keywords can improve VLMs' performance on visual perception tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often fail at basic visual perception skills like judging depth order or recognizing viewpoints because everyday image datasets give them too little targeted practice on these low-level abilities. The paper tests whether a system that creates synthetic training examples using only the name of a perception task can close that gap. The system uses language models to write questions and image prompts, text-to-image models to draw the pictures, and another model to verify that the answers match the images. When models are trained on the resulting 10,000 examples across ten tasks, they score higher on dedicated perception tests while keeping their general skills and improving further as more synthetic data is added.

Core claim

The paper establishes that VisionFoundry, a pipeline taking only a task name such as Depth Order as input, can produce 10,000 consistent image-question-answer triples by leveraging LLMs for question and prompt generation, text-to-image synthesis for visuals, and VLM verification for consistency, and that training VLMs on this dataset yields +7% on MMVP and +10% on CV-Bench-3D without loss of broader capabilities.

What carries the argument

VisionFoundry, the task-aware synthetic data generation pipeline that automates creation of VQA triples from task keywords using LLMs, T2I models, and VLM verification without needing reference images or human labels.

If this is right

  • Training on VisionFoundry-10K delivers clear gains on visual perception benchmarks such as MMVP and CV-Bench-3D.
  • General capabilities on other tasks remain intact after the synthetic training.
  • Performance on perception tasks scales favorably as the amount of synthetic data increases.
  • Targeted synthetic supervision can address specific weaknesses in VLMs that natural data leaves unaddressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the pipeline to additional visual tasks could further reduce perception errors across more benchmarks.
  • Blending the synthetic data with existing real-image datasets might produce additive improvements beyond either alone.
  • Replacing the proprietary verification model with open alternatives could make the method more accessible while preserving quality.
  • The results point to a broader opportunity for automated, scalable supervision in training multimodal models.

Load-bearing premise

The synthetic triples generated and verified without real reference images actually impart visual perception skills that transfer to real-world photographs instead of exploiting quirks of the image synthesis or verification steps.

What would settle it

If training a VLM on the VisionFoundry-10K dataset produces no gains or even lower scores on real-image perception benchmarks like MMVP and CV-Bench-3D, that would indicate the synthetic data does not provide transferable supervision.

read the original abstract

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces VisionFoundry, a synthetic data generation pipeline for visual perception tasks in vision-language models (VLMs). The pipeline uses LLMs to generate questions, answers, and text-to-image prompts from task keywords, synthesizes images using T2I models, and verifies the consistency using a proprietary VLM, all without reference images or human annotation. They generate the VisionFoundry-10K dataset with 10k VQA triples across 10 tasks and demonstrate that models trained on this data achieve +7% improvement on the MMVP benchmark and +10% on CV-Bench-3D, while maintaining performance on broader capabilities and exhibiting positive scaling with increased data size.

Significance. If the reported gains reflect genuine acquisition of transferable low-level visual perception skills rather than artifacts or verifier biases, the work would be significant for VLM training. It provides evidence that limited, task-targeted synthetic supervision can address specific bottlenecks in spatial understanding and viewpoint recognition, offering a scalable alternative to human-annotated real-image datasets and supporting further investment in synthetic data pipelines.

major comments (3)
  1. [Abstract] Abstract: The central claims of +7% on MMVP and +10% on CV-Bench-3D are presented without any details on the base VLM, baseline models, training procedure, data splits, controls for the proprietary verifier, or statistical significance tests. These omissions directly undermine evaluation of the empirical results.
  2. [Pipeline description] Pipeline description: Verification of image-QA triples relies exclusively on a proprietary VLM with no human validation, reference images, or error analysis. Since T2I outputs commonly contain geometric/lighting inconsistencies that verifiers may miss or endorse, this setup risks the trained model learning synthetic artifacts rather than real photographic perception skills, which is load-bearing for the transfer claim.
  3. [Results] Results section: The assertion of 'favorable scaling behavior as data size increases' is stated qualitatively but lacks quantitative support such as performance curves, specific dataset sizes tested, or comparisons, weakening the argument that synthetic data is a scalable solution.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns about missing experimental details, verification robustness, and quantitative scaling evidence. Below we respond to each major comment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of +7% on MMVP and +10% on CV-Bench-3D are presented without any details on the base VLM, baseline models, training procedure, data splits, controls for the proprietary verifier, or statistical significance tests. These omissions directly undermine evaluation of the empirical results.

    Authors: We agree that the abstract was overly concise and omitted key setup information. In the revised manuscript we have expanded the abstract to name the base VLM (LLaVA-1.5-7B), note the LoRA fine-tuning protocol, reference the 80/20 data split, and state that gains are reported with standard deviations across three seeds. The main text already contained these elements; the abstract now summarizes them for clarity. revision: yes

  2. Referee: [Pipeline description] Pipeline description: Verification of image-QA triples relies exclusively on a proprietary VLM with no human validation, reference images, or error analysis. Since T2I outputs commonly contain geometric/lighting inconsistencies that verifiers may miss or endorse, this setup risks the trained model learning synthetic artifacts rather than real photographic perception skills, which is load-bearing for the transfer claim.

    Authors: This concern is valid and central to the transfer claim. The original submission contained no human validation or error analysis. We have added a new subsection reporting manual inspection of 200 randomly sampled triples by two independent annotators, yielding 87% agreement with the proprietary verifier. We also include a failure-case analysis in the appendix. Full human annotation of the 10k dataset remains outside current resources, but the ablation showing degraded gains without verification and the positive transfer to real-image benchmarks provide supporting evidence that artifacts are not the primary driver of the observed improvements. revision: partial

  3. Referee: [Results] Results section: The assertion of 'favorable scaling behavior as data size increases' is stated qualitatively but lacks quantitative support such as performance curves, specific dataset sizes tested, or comparisons, weakening the argument that synthetic data is a scalable solution.

    Authors: We accept that the scaling claim was presented only qualitatively. The revised results section now includes a new figure plotting MMVP and CV-Bench-3D accuracy for dataset sizes 1k, 5k, and 10k, together with tabulated numbers showing monotonic improvement (e.g., MMVP rises from 48.2% at 1k to 62.1% at 10k). A brief comparison to scaling curves obtained from subsampled real-image data is also added. revision: yes

standing simulated objections not resolved
  • Full human annotation and verification of the entire 10k VisionFoundry dataset, which exceeds available annotation budget.

Circularity Check

0 steps flagged

No circularity; empirical gains measured on external benchmarks

full rationale

The paper describes an empirical pipeline that generates synthetic VQA triples from task keywords using LLMs, T2I synthesis, and VLM verification, then trains models and reports accuracy lifts on independent public benchmarks (MMVP, CV-Bench-3D). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the claimed results. The performance numbers are obtained from external test sets whose definitions and labels are independent of the synthetic generation process, so the reported improvements cannot reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 1 invented entities

The central claim depends on several untested assumptions about the fidelity of LLM-generated supervision and the transferability of synthetic images to real visual perception.

axioms (3)
  • domain assumption LLMs can produce accurate, task-relevant questions and answers for visual perception skills from a keyword alone
    Invoked when the pipeline starts from only the task name to generate Q&A pairs.
  • domain assumption Text-to-image models can render images whose visual content reliably matches the LLM-generated prompts and answers
    Required for the synthesized images to serve as valid training examples.
  • domain assumption A separate VLM can correctly detect and filter inconsistent image-question-answer triples
    Used as the final quality gate without human review.
invented entities (1)
  • VisionFoundry pipeline no independent evidence
    purpose: End-to-end synthetic data generator for task-specific VLM supervision
    The paper introduces this as a new system; no independent evidence of its correctness is provided beyond the reported benchmark gains.

pith-pipeline@v0.9.0 · 5546 in / 1589 out tokens · 66094 ms · 2026-05-10T17:16:19.044136+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    The answer-determining fact MUST be 100% visually verifiable from the final image

  2. [2]

    Text prompt must explicitly describe content matching the correct answer

  3. [3]

    Never rely on invisible properties

  4. [4]

    prompt": extremely detailed text-to-image prompt (English) -

    Generate deterministic, unambiguous questions. Constraints: <COMMA_JOINED_CONSTRAINTS_OR_NONE> Return EXACTLY ONE JSON object with keys: - "prompt": extremely detailed text-to-image prompt (English) - "question": clear VQA question 18 - "answer": clear deterministic answer - "metadata": {"difficulty": "easy", "category": "<TASK_ID>", "num_objects": <NUM_O...