Synthetic Stimuli, Real Gains: Rethinking VLM Fine-Tuning Through Fully Controlled Data Generation

Giuseppe Riccardi; Massimo Rizzoli; Seyed Mahed Mousavi; Simone Alghisi

arxiv: 2511.11440 · v3 · pith:A6G7TRL7new · submitted 2025-11-14 · 💻 cs.CV · cs.CL

Synthetic Stimuli, Real Gains: Rethinking VLM Fine-Tuning Through Fully Controlled Data Generation

Massimo Rizzoli , Simone Alghisi , Seyed Mahed Mousavi , Giuseppe Riccardi This is my paper

classification 💻 cs.CV cs.CL

keywords datafine-tuningperformancereal-worldsyntheticannotationdistributiongeneration

0 comments

read the original abstract

Performance gains of Vision Language Models (VLMs) obtained by fine-tuning are generally based on ad hoc data collection and annotation of real-world scenes. Despite the improvements, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have explored synthetic data generation, they typically lack control over data distribution and annotation quality. In this work, we re-evaluate the potential of model fine-tuning by exploring a fully controlled data generation and annotation pipeline, obtaining bias-free data with balanced distribution and clean annotations. Using the spatial reasoning task of identifying the absolute position of an object as a use case, we fine-tune state-of-the-art VLMs and conduct exhaustive evaluations on both synthetic and real-world benchmarks, including transferability to real-world scenes. Our experiments reveal two key findings: 1) fine-tuning on balanced data yields uniform performance across the visual scene and mitigates common biases with as few as 130 samples; and 2) fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision
cs.RO 2026-04 unverdicted novelty 6.0

Vision-language models generate executable Behavior Tree policies for robots from synthetic vision-language data, with successful transfer demonstrated on two real manipulators.
Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision
cs.RO 2026-04 unverdicted novelty 6.0

A 12B-parameter VLM learns to synthesize executable Behavior Tree policies from multimodal inputs via synthetic neuro-symbolic supervision, achieving zero-shot real-world transfer on robotic manipulators.