PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
Pith reviewed 2026-05-25 04:38 UTC · model grok-4.3
The pith
Overlaying unambiguous geometric primitives on images supplies dense supervision that improves visual grounding in MLLMs and produces gains of up to 20% on spatial benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PGT generates new training examples by overlaying unambiguous geometric primitives on images, creating dense supervision that disentangles visual grounding capability from semantic priors. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data yields improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. Finetuning state-of-the-art MLLMs on PGT data produces additional boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D, with gains observed across diverse architectures on relational, quantitative, and 3D understanding tasks.
What carries the argument
Procedurally Generated Tasks (PGT), a data generation framework that overlays unambiguous geometric primitives on images to produce dense visual supervision signals.
If this is right
- Instruction tuning on PGT-augmented data improves performance on relational, quantitative, and 3D/depth understanding benchmarks.
- The gains occur while general perception capabilities remain intact.
- Finetuning already strong MLLMs on PGT data produces further measurable lifts on the same spatial tasks.
- PGT functions as both an improvement method and a diagnostic tool for locating perception failures.
Where Pith is reading between the lines
- If the supervision signal is the main bottleneck, then similar procedural overlays could be applied to other perception tasks such as object counting or attribute binding.
- The approach implies that scaling the volume of geometrically annotated data might continue to raise spatial performance even for larger models.
- PGT-style diagnostics could be used to test whether other MLLM deficits, such as hallucination on visual attributes, also reduce when unambiguous visual cues are added during training.
Load-bearing premise
Overlaying unambiguous geometric primitives supplies supervision that truly disentangles visual grounding from semantic priors and that inadequate supervision is the primary source of spatial reasoning deficits rather than model capacity or resolution limits.
What would settle it
Training an MLLM on PGT-augmented data and then measuring zero improvement on the What'sUp benchmark when the geometric overlays are removed from test images would show that the claimed disentanglement and generalization do not hold.
Figures
read the original abstract
Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Procedurally Generated Tasks (PGT), a framework that overlays unambiguous geometric primitives on images to create additional dense supervision for fine-grained visual understanding in MLLMs. PGT is presented as both an augmentation method for instruction tuning and a diagnostic tool. The central empirical claim is that augmenting LLaVA-v1.5-Instruct with PGT data yields gains of up to +20% on What'sUp and +13.3% on CV-Bench-2D, with smaller gains (+5.5% and +8.3%) when finetuning SOTA models, while preserving general perception capabilities. The authors conclude that many spatial reasoning deficits arise from inadequate supervision rather than architectural or resolution limitations.
Significance. If the central interpretation holds after appropriate controls, the work would be significant for multimodal learning research. It offers a low-cost, procedurally generated data augmentation strategy that targets visual grounding without altering model architecture or resolution. The dual diagnostic role could help isolate perception failures. Demonstrating gains across architectures and no degradation in general capabilities strengthens the practical value. The emphasis on supervision as the primary bottleneck could redirect attention toward targeted data curation for spatial tasks in MLLMs.
major comments (2)
- [§4 (Experiments), Table 2] §4 (Experiments), Table 2: The headline gains (+20% on What'sUp, +13.3% on CV-Bench-2D) compare LLaVA-v1.5-Instruct to the same model augmented with PGT data, but no ablation holds total training data volume fixed by adding an equivalent number of non-geometric synthetic examples. This control is required to substantiate that the geometric primitives supply a qualitatively different signal that disentangles visual grounding from semantic priors, rather than the gains arising from generic data-augmentation effects.
- [§4.3 (Ablations) and §5 (Conclusion)] §4.3 (Ablations) and §5 (Conclusion): The claim that spatial deficits 'stem from inadequate supervision signals rather than inherent architectural or resolution limitations' is not directly tested by experiments that vary model scale or input resolution while keeping the original training mixture fixed. Without such tests, the interpretation that PGT addresses the primary source of the observed deficits remains under-supported.
minor comments (2)
- [Abstract] Abstract: The claim of 'extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks' is not accompanied by an explicit list of all evaluated benchmarks beyond What'sUp and CV-Bench-2D; adding this would improve transparency.
- [§3 (PGT Framework)] §3 (PGT Framework): The procedural generation process for the geometric primitives and associated QA pairs could include pseudocode or additional implementation details to facilitate reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below, providing clarifications and indicating where revisions will be made.
read point-by-point responses
-
Referee: [§4 (Experiments), Table 2] §4 (Experiments), Table 2: The headline gains (+20% on What'sUp, +13.3% on CV-Bench-2D) compare LLaVA-v1.5-Instruct to the same model augmented with PGT data, but no ablation holds total training data volume fixed by adding an equivalent number of non-geometric synthetic examples. This control is required to substantiate that the geometric primitives supply a qualitatively different signal that disentangles visual grounding from semantic priors, rather than the gains arising from generic data-augmentation effects.
Authors: We agree that an ablation holding total training data volume fixed with an equivalent number of non-geometric synthetic examples would strengthen the evidence that the geometric primitives provide a distinct signal. Our existing ablations in §4.3 compare PGT against other augmentation strategies and show that gains are specific to the geometric overlays. We will incorporate this additional control experiment in the revised manuscript to directly address the concern. revision: yes
-
Referee: [§4.3 (Ablations) and §5 (Conclusion)] §4.3 (Ablations) and §5 (Conclusion): The claim that spatial deficits 'stem from inadequate supervision signals rather than inherent architectural or resolution limitations' is not directly tested by experiments that vary model scale or input resolution while keeping the original training mixture fixed. Without such tests, the interpretation that PGT addresses the primary source of the observed deficits remains under-supported.
Authors: The experiments demonstrate consistent gains from PGT across fixed architectures and input resolutions, indicating that supervision can substantially mitigate the observed spatial deficits. We acknowledge that experiments varying model scale or resolution while holding the training mixture fixed would provide more direct support for the interpretation. We will revise the language in §5 and the conclusion to more precisely frame the findings as evidence that inadequate supervision is a key addressable factor, rather than asserting it as the sole primary source. revision: partial
Circularity Check
Empirical augmentation study with no circular derivations or self-referential predictions
full rationale
The paper is an empirical study that augments training data with procedurally generated geometric overlays and reports benchmark gains on external tasks (What'sUp, CV-Bench-2D). No equations, first-principles derivations, or predictions appear in the provided text. No self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are present. Results are measured against held-out benchmarks and therefore remain falsifiable outside the training mixture. This is the normal non-circular outcome for a data-augmentation paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Additional dense visual supervision from unambiguous primitives will improve grounding without harming general capabilities
Reference graph
Works this paper leans on
-
[1]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[2]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[3]
When are Lemons Purple? The Concept Association Bias of Vision-Language Models , author=. 2024 , eprint=
work page 2024
-
[4]
M. J. Kearns , title =
-
[5]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[6]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[7]
Suppressed for Anonymity , author=
-
[8]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[9]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[10]
SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models , author=. 2025 , eprint=
work page 2025
- [11]
-
[12]
Grounded Reinforcement Learning for Visual Reasoning , author=. 2025 , eprint=
work page 2025
-
[13]
ViperGPT: Visual Inference via Python Execution for Reasoning , author=. 2023 , eprint=
work page 2023
-
[14]
Visual Programming: Compositional visual reasoning without training , author=. 2022 , eprint=
work page 2022
-
[15]
Locality Alignment Improves Vision-Language Models , author=. 2025 , eprint=
work page 2025
-
[16]
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. 2024 , eprint=
work page 2024
-
[17]
LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=
work page 2021
-
[18]
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models , author=. 2024 , eprint=
work page 2024
-
[19]
Visual Representation Alignment for Multimodal Large Language Models , author=. 2025 , eprint=
work page 2025
-
[20]
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding , author=. 2025 , eprint=
work page 2025
-
[21]
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models , author=. 2024 , eprint=
work page 2024
-
[22]
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models , author=. 2024 , eprint=
work page 2024
-
[23]
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models , author=. 2024 , eprint=
work page 2024
-
[24]
Vision Transformers with Self-Distilled Registers , author=. 2025 , eprint=
work page 2025
-
[25]
Object-centric Binding in Contrastive Language-Image Pretraining , author=. 2025 , eprint=
work page 2025
-
[26]
Hidden in plain sight: VLMs overlook their visual representations , author=. 2025 , eprint=
work page 2025
-
[27]
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs , author=. 2024 , eprint=
work page 2024
-
[28]
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features , author=. 2025 , eprint=
work page 2025
-
[29]
FeatSharp: Your Vision Model Features, Sharper , author=. 2025 , eprint=
work page 2025
- [30]
-
[31]
What's "up" with vision-language models? Investigating their struggle with spatial reasoning , author=. 2023 , eprint=
work page 2023
-
[32]
Spatial Intelligence in Vision-Language Models: A Comprehensive Survey , author=. 2025 , month=
work page 2025
-
[33]
Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models , author=. 2024 , eprint=
work page 2024
-
[34]
Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs , author=. 2025 , eprint=
work page 2025
- [35]
- [36]
-
[37]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. 2025 , eprint=
work page 2025
-
[38]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models , author=. 2024 , eprint=
work page 2024
-
[39]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
-
[40]
Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=
work page 2021
-
[41]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author=. 2025 , eprint=
work page 2025
-
[42]
Perception-R1: Pioneering Perception Policy with Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[43]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[44]
BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. 2024 , eprint=
work page 2024
-
[45]
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation , author=. 2026 , eprint=
work page 2026
-
[46]
Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
Jian, Pu and Yu, Donglei and Yang, Wen and Ren, Shuo and Zhang, Jiajun. Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.182
-
[47]
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces , author=. 2025 , eprint=
work page 2025
- [48]
- [49]
-
[50]
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement , author=. 2025 , eprint=
work page 2025
-
[51]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=
work page 2023
-
[52]
TallyQA: Answering Complex Counting Questions , author=. 2018 , eprint=
work page 2018
-
[53]
Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. 2024 , eprint=
work page 2024
-
[54]
RealWorldQA: A Benchmark for Real-World Spatial Understanding , author =. 2024 , url =
work page 2024
-
[55]
Hiippala, Tuomo and Alikhani, Malihe and Haverinen, Jonas and Kalliokoski, Timo and Logacheva, Evanfiya and Orekhova, Serafina and Tuomainen, Aino and Stone, Matthew and Bateman, John A. , year=. AI2D-RST: a multimodal corpus of 1000 primary school science diagrams , volume=. Language Resources and Evaluation , publisher=. doi:10.1007/s10579-020-09517-1 , number=
-
[56]
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
Drew A. Hudson and Christopher D. Manning , title =. CoRR , volume =. 2019 , url =. 1902.09506 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[57]
Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=
work page 2023
- [58]
-
[59]
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps , author=. 2026 , eprint=
work page 2026
-
[60]
Can Large Vision Language Models Read Maps Like a Human? , author=. 2025 , eprint=
work page 2025
-
[61]
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs , author=. 2023 , eprint=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.