pith. sign in

arxiv: 2605.23883 · v1 · pith:W75NYE45new · submitted 2026-05-22 · 💻 cs.CV · cs.AI

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

Pith reviewed 2026-05-25 04:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal large language modelsvisual groundingspatial reasoningprocedural generationinstruction tuningfine-grained perceptiongeometric primitives
0
0 comments X

The pith

Overlaying unambiguous geometric primitives on images supplies dense supervision that improves visual grounding in MLLMs and produces gains of up to 20% on spatial benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Procedurally Generated Tasks to create additional training data by placing clear geometric shapes and lines on existing images. This produces supervision signals that separate precise visual attention from language-based priors. When added to standard instruction-tuning sets, the data lifts results on relational and depth benchmarks while leaving general perception performance unchanged. The central argument is that many observed shortfalls in fine-grained spatial understanding trace to insufficient training signals rather than limits in model size or image resolution.

Core claim

PGT generates new training examples by overlaying unambiguous geometric primitives on images, creating dense supervision that disentangles visual grounding capability from semantic priors. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data yields improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. Finetuning state-of-the-art MLLMs on PGT data produces additional boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D, with gains observed across diverse architectures on relational, quantitative, and 3D understanding tasks.

What carries the argument

Procedurally Generated Tasks (PGT), a data generation framework that overlays unambiguous geometric primitives on images to produce dense visual supervision signals.

If this is right

  • Instruction tuning on PGT-augmented data improves performance on relational, quantitative, and 3D/depth understanding benchmarks.
  • The gains occur while general perception capabilities remain intact.
  • Finetuning already strong MLLMs on PGT data produces further measurable lifts on the same spatial tasks.
  • PGT functions as both an improvement method and a diagnostic tool for locating perception failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the supervision signal is the main bottleneck, then similar procedural overlays could be applied to other perception tasks such as object counting or attribute binding.
  • The approach implies that scaling the volume of geometrically annotated data might continue to raise spatial performance even for larger models.
  • PGT-style diagnostics could be used to test whether other MLLM deficits, such as hallucination on visual attributes, also reduce when unambiguous visual cues are added during training.

Load-bearing premise

Overlaying unambiguous geometric primitives supplies supervision that truly disentangles visual grounding from semantic priors and that inadequate supervision is the primary source of spatial reasoning deficits rather than model capacity or resolution limits.

What would settle it

Training an MLLM on PGT-augmented data and then measuring zero improvement on the What'sUp benchmark when the geometric overlays are removed from test images would show that the claimed disentanglement and generalization do not hold.

Figures

Figures reproduced from arXiv: 2605.23883 by Adriana Romero-Soriano, Amir Bar, Michal Drozdzal, Rim Assouel.

Figure 1
Figure 1. Figure 1: Overview of PGT. Top: The construction of our procedurally generated data to augment instruction tuning training datasets. Abstract geometric primitives are overlaid to training data, when available. Bottom: (Left) Examples of failure modes in fine-grained relational and spatial understanding of state-of-the-art MLLMs. In the first example the model can rely on the fact that a bowl is usually on a table an… view at source ↗
Figure 2
Figure 2. Figure 2: Our suite of PGT: (left) spatial relationship reasoning, (center) abstract counting, and (right) 2D relative distance estimation. In this section, we provide the specific prompts and templates used for generating the Procedurally Generated Tasks (PGT) as well as the prompts used for the Specialized Mix (constructed from TallyQA, VSR, and Spatial Ladder). A.1. Handling Occlusion and Semantic Preservation To… view at source ↗
read the original abstract

Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Procedurally Generated Tasks (PGT), a framework that overlays unambiguous geometric primitives on images to create additional dense supervision for fine-grained visual understanding in MLLMs. PGT is presented as both an augmentation method for instruction tuning and a diagnostic tool. The central empirical claim is that augmenting LLaVA-v1.5-Instruct with PGT data yields gains of up to +20% on What'sUp and +13.3% on CV-Bench-2D, with smaller gains (+5.5% and +8.3%) when finetuning SOTA models, while preserving general perception capabilities. The authors conclude that many spatial reasoning deficits arise from inadequate supervision rather than architectural or resolution limitations.

Significance. If the central interpretation holds after appropriate controls, the work would be significant for multimodal learning research. It offers a low-cost, procedurally generated data augmentation strategy that targets visual grounding without altering model architecture or resolution. The dual diagnostic role could help isolate perception failures. Demonstrating gains across architectures and no degradation in general capabilities strengthens the practical value. The emphasis on supervision as the primary bottleneck could redirect attention toward targeted data curation for spatial tasks in MLLMs.

major comments (2)
  1. [§4 (Experiments), Table 2] §4 (Experiments), Table 2: The headline gains (+20% on What'sUp, +13.3% on CV-Bench-2D) compare LLaVA-v1.5-Instruct to the same model augmented with PGT data, but no ablation holds total training data volume fixed by adding an equivalent number of non-geometric synthetic examples. This control is required to substantiate that the geometric primitives supply a qualitatively different signal that disentangles visual grounding from semantic priors, rather than the gains arising from generic data-augmentation effects.
  2. [§4.3 (Ablations) and §5 (Conclusion)] §4.3 (Ablations) and §5 (Conclusion): The claim that spatial deficits 'stem from inadequate supervision signals rather than inherent architectural or resolution limitations' is not directly tested by experiments that vary model scale or input resolution while keeping the original training mixture fixed. Without such tests, the interpretation that PGT addresses the primary source of the observed deficits remains under-supported.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks' is not accompanied by an explicit list of all evaluated benchmarks beyond What'sUp and CV-Bench-2D; adding this would improve transparency.
  2. [§3 (PGT Framework)] §3 (PGT Framework): The procedural generation process for the geometric primitives and associated QA pairs could include pseudocode or additional implementation details to facilitate reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below, providing clarifications and indicating where revisions will be made.

read point-by-point responses
  1. Referee: [§4 (Experiments), Table 2] §4 (Experiments), Table 2: The headline gains (+20% on What'sUp, +13.3% on CV-Bench-2D) compare LLaVA-v1.5-Instruct to the same model augmented with PGT data, but no ablation holds total training data volume fixed by adding an equivalent number of non-geometric synthetic examples. This control is required to substantiate that the geometric primitives supply a qualitatively different signal that disentangles visual grounding from semantic priors, rather than the gains arising from generic data-augmentation effects.

    Authors: We agree that an ablation holding total training data volume fixed with an equivalent number of non-geometric synthetic examples would strengthen the evidence that the geometric primitives provide a distinct signal. Our existing ablations in §4.3 compare PGT against other augmentation strategies and show that gains are specific to the geometric overlays. We will incorporate this additional control experiment in the revised manuscript to directly address the concern. revision: yes

  2. Referee: [§4.3 (Ablations) and §5 (Conclusion)] §4.3 (Ablations) and §5 (Conclusion): The claim that spatial deficits 'stem from inadequate supervision signals rather than inherent architectural or resolution limitations' is not directly tested by experiments that vary model scale or input resolution while keeping the original training mixture fixed. Without such tests, the interpretation that PGT addresses the primary source of the observed deficits remains under-supported.

    Authors: The experiments demonstrate consistent gains from PGT across fixed architectures and input resolutions, indicating that supervision can substantially mitigate the observed spatial deficits. We acknowledge that experiments varying model scale or resolution while holding the training mixture fixed would provide more direct support for the interpretation. We will revise the language in §5 and the conclusion to more precisely frame the findings as evidence that inadequate supervision is a key addressable factor, rather than asserting it as the sole primary source. revision: partial

Circularity Check

0 steps flagged

Empirical augmentation study with no circular derivations or self-referential predictions

full rationale

The paper is an empirical study that augments training data with procedurally generated geometric overlays and reports benchmark gains on external tasks (What'sUp, CV-Bench-2D). No equations, first-principles derivations, or predictions appear in the provided text. No self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are present. Results are measured against held-out benchmarks and therefore remain falsifiable outside the training mixture. This is the normal non-circular outcome for a data-augmentation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of procedurally generated geometric supervision; no new physical entities are postulated and the method uses standard supervised fine-tuning assumptions.

axioms (1)
  • domain assumption Additional dense visual supervision from unambiguous primitives will improve grounding without harming general capabilities
    Invoked when claiming that PGT-augmented training yields gains while maintaining perception capabilities.

pith-pipeline@v0.9.0 · 5767 in / 1246 out tokens · 31464 ms · 2026-05-25T04:38:42.291077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    2024 , eprint=

    When are Lemons Purple? The Concept Association Bias of Vision-Language Models , author=. 2024 , eprint=

  4. [4]

    M. J. Kearns , title =

  5. [5]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  6. [6]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  7. [7]

    Suppressed for Anonymity , author=

  8. [8]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  9. [9]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  10. [10]

    2025 , eprint=

    SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models , author=. 2025 , eprint=

  11. [11]

    2025 , eprint=

    Visual Jigsaw Post-Training Improves MLLMs , author=. 2025 , eprint=

  12. [12]

    2025 , eprint=

    Grounded Reinforcement Learning for Visual Reasoning , author=. 2025 , eprint=

  13. [13]

    2023 , eprint=

    ViperGPT: Visual Inference via Python Execution for Reasoning , author=. 2023 , eprint=

  14. [14]

    2022 , eprint=

    Visual Programming: Compositional visual reasoning without training , author=. 2022 , eprint=

  15. [15]

    2025 , eprint=

    Locality Alignment Improves Vision-Language Models , author=. 2025 , eprint=

  16. [16]

    2024 , eprint=

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. 2024 , eprint=

  17. [17]

    2021 , eprint=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

  18. [18]

    2024 , eprint=

    Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models , author=. 2024 , eprint=

  19. [19]

    2025 , eprint=

    Visual Representation Alignment for Multimodal Large Language Models , author=. 2025 , eprint=

  20. [20]

    2025 , eprint=

    SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding , author=. 2025 , eprint=

  21. [21]

    2024 , eprint=

    Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models , author=. 2024 , eprint=

  22. [22]

    2024 , eprint=

    Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models , author=. 2024 , eprint=

  23. [23]

    2024 , eprint=

    Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models , author=. 2024 , eprint=

  24. [24]

    2025 , eprint=

    Vision Transformers with Self-Distilled Registers , author=. 2025 , eprint=

  25. [25]

    2025 , eprint=

    Object-centric Binding in Contrastive Language-Image Pretraining , author=. 2025 , eprint=

  26. [26]

    2025 , eprint=

    Hidden in plain sight: VLMs overlook their visual representations , author=. 2025 , eprint=

  27. [27]

    2024 , eprint=

    PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs , author=. 2024 , eprint=

  28. [28]

    2025 , eprint=

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features , author=. 2025 , eprint=

  29. [29]

    2025 , eprint=

    FeatSharp: Your Vision Model Features, Sharper , author=. 2025 , eprint=

  30. [30]

    2024 , eprint=

    Vision Transformers Need Registers , author=. 2024 , eprint=

  31. [31]

    2023 , eprint=

    What's "up" with vision-language models? Investigating their struggle with spatial reasoning , author=. 2023 , eprint=

  32. [32]

    2025 , month=

    Spatial Intelligence in Vision-Language Models: A Comprehensive Survey , author=. 2025 , month=

  33. [33]

    2024 , eprint=

    Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models , author=. 2024 , eprint=

  34. [34]

    2025 , eprint=

    Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs , author=. 2025 , eprint=

  35. [35]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  36. [36]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  37. [37]

    2025 , eprint=

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. 2025 , eprint=

  38. [38]

    2024 , eprint=

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models , author=. 2024 , eprint=

  39. [39]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  40. [40]

    2021 , eprint=

    Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

  41. [41]

    2025 , eprint=

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author=. 2025 , eprint=

  42. [42]

    2025 , eprint=

    Perception-R1: Pioneering Perception Policy with Reinforcement Learning , author=. 2025 , eprint=

  43. [43]

    2025 , eprint=

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning , author=. 2025 , eprint=

  44. [44]

    2024 , eprint=

    BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. 2024 , eprint=

  45. [45]

    2026 , eprint=

    Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation , author=. 2026 , eprint=

  46. [46]

    Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

    Jian, Pu and Yu, Donglei and Yang, Wen and Ren, Shuo and Zhang, Jiajun. Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.182

  47. [47]

    2025 , eprint=

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces , author=. 2025 , eprint=

  48. [48]

    2023 , eprint=

    Visual Instruction Tuning , author=. 2023 , eprint=

  49. [49]

    2023 , eprint=

    Visual Spatial Reasoning , author=. 2023 , eprint=

  50. [50]

    2025 , eprint=

    SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement , author=. 2025 , eprint=

  51. [51]

    2023 , eprint=

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

  52. [52]

    2018 , eprint=

    TallyQA: Answering Complex Counting Questions , author=. 2018 , eprint=

  53. [53]

    2024 , eprint=

    Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. 2024 , eprint=

  54. [54]

    2024 , url =

    RealWorldQA: A Benchmark for Real-World Spatial Understanding , author =. 2024 , url =

  55. [55]

    Hiippala, Tuomo and Alikhani, Malihe and Haverinen, Jonas and Kalliokoski, Timo and Logacheva, Evanfiya and Orekhova, Serafina and Tuomainen, Aino and Stone, Matthew and Bateman, John A. , year=. AI2D-RST: a multimodal corpus of 1000 primary school science diagrams , volume=. Language Resources and Evaluation , publisher=. doi:10.1007/s10579-020-09517-1 , number=

  56. [56]

    GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

    Drew A. Hudson and Christopher D. Manning , title =. CoRR , volume =. 2019 , url =. 1902.09506 , timestamp =

  57. [57]

    2023 , eprint=

    Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

  58. [58]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  59. [59]

    2026 , eprint=

    ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps , author=. 2026 , eprint=

  60. [60]

    2025 , eprint=

    Can Large Vision Language Models Read Maps Like a Human? , author=. 2025 , eprint=

  61. [61]

    2023 , eprint=

    V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs , author=. 2023 , eprint=