PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

Adriana Romero-Soriano; Amir Bar; Michal Drozdzal; Rim Assouel

arxiv: 2605.23883 · v1 · pith:W75NYE45new · submitted 2026-05-22 · 💻 cs.CV · cs.AI

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

Rim Assouel , Amir Bar , Michal Drozdzal , Adriana Romero-Soriano This is my paper

Pith reviewed 2026-05-25 04:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal large language modelsvisual groundingspatial reasoningprocedural generationinstruction tuningfine-grained perceptiongeometric primitives

0 comments

The pith

Overlaying unambiguous geometric primitives on images supplies dense supervision that improves visual grounding in MLLMs and produces gains of up to 20% on spatial benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Procedurally Generated Tasks to create additional training data by placing clear geometric shapes and lines on existing images. This produces supervision signals that separate precise visual attention from language-based priors. When added to standard instruction-tuning sets, the data lifts results on relational and depth benchmarks while leaving general perception performance unchanged. The central argument is that many observed shortfalls in fine-grained spatial understanding trace to insufficient training signals rather than limits in model size or image resolution.

Core claim

PGT generates new training examples by overlaying unambiguous geometric primitives on images, creating dense supervision that disentangles visual grounding capability from semantic priors. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data yields improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. Finetuning state-of-the-art MLLMs on PGT data produces additional boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D, with gains observed across diverse architectures on relational, quantitative, and 3D understanding tasks.

What carries the argument

Procedurally Generated Tasks (PGT), a data generation framework that overlays unambiguous geometric primitives on images to produce dense visual supervision signals.

If this is right

Instruction tuning on PGT-augmented data improves performance on relational, quantitative, and 3D/depth understanding benchmarks.
The gains occur while general perception capabilities remain intact.
Finetuning already strong MLLMs on PGT data produces further measurable lifts on the same spatial tasks.
PGT functions as both an improvement method and a diagnostic tool for locating perception failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the supervision signal is the main bottleneck, then similar procedural overlays could be applied to other perception tasks such as object counting or attribute binding.
The approach implies that scaling the volume of geometrically annotated data might continue to raise spatial performance even for larger models.
PGT-style diagnostics could be used to test whether other MLLM deficits, such as hallucination on visual attributes, also reduce when unambiguous visual cues are added during training.

Load-bearing premise

Overlaying unambiguous geometric primitives supplies supervision that truly disentangles visual grounding from semantic priors and that inadequate supervision is the primary source of spatial reasoning deficits rather than model capacity or resolution limits.

What would settle it

Training an MLLM on PGT-augmented data and then measuring zero improvement on the What'sUp benchmark when the geometric overlays are removed from test images would show that the claimed disentanglement and generalization do not hold.

Figures

Figures reproduced from arXiv: 2605.23883 by Adriana Romero-Soriano, Amir Bar, Michal Drozdzal, Rim Assouel.

**Figure 1.** Figure 1: Overview of PGT. Top: The construction of our procedurally generated data to augment instruction tuning training datasets. Abstract geometric primitives are overlaid to training data, when available. Bottom: (Left) Examples of failure modes in fine-grained relational and spatial understanding of state-of-the-art MLLMs. In the first example the model can rely on the fact that a bowl is usually on a table an… view at source ↗

**Figure 2.** Figure 2: Our suite of PGT: (left) spatial relationship reasoning, (center) abstract counting, and (right) 2D relative distance estimation. In this section, we provide the specific prompts and templates used for generating the Procedurally Generated Tasks (PGT) as well as the prompts used for the Specialized Mix (constructed from TallyQA, VSR, and Spatial Ladder). A.1. Handling Occlusion and Semantic Preservation To… view at source ↗

read the original abstract

Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PGT adds geometric overlays to training data and reports solid benchmark lifts on spatial tasks, but the gains could be from extra data volume rather than the geometric signal itself.

read the letter

The core result is that instruction-tuning LLaVA-v1.5 on its original mix plus PGT data lifts What'sUp by up to 20% and CV-Bench-2D by 13.3%, with smaller but still positive gains when fine-tuning stronger models. The method itself is straightforward: overlay simple geometric primitives on images to create dense, unambiguous supervision that the authors argue separates visual grounding from language priors. That framing is new enough in the MLLM literature and the dual use as both training signal and diagnostic is a useful angle. The paper also checks that general perception does not degrade, which is a practical plus. The main weakness is the missing controls the stress-test flagged. There is no ablation that holds total training tokens or examples fixed, and no direct comparison to non-geometric synthetic captions of similar volume. Without those, the improvements are compatible with generic data-augmentation effects rather than proof that inadequate supervision is the main bottleneck. The abstract gives no statistical details or exclusion rules either, so robustness is hard to judge from what is shown. This work is aimed at groups already fine-tuning MLLMs for robotics or visual reasoning who want a cheap way to target spatial failures. It is worth sending to peer review so referees can check the ablations and data-matching details; the idea is simple and the reported deltas are large enough to matter if they hold up.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Procedurally Generated Tasks (PGT), a framework that overlays unambiguous geometric primitives on images to create additional dense supervision for fine-grained visual understanding in MLLMs. PGT is presented as both an augmentation method for instruction tuning and a diagnostic tool. The central empirical claim is that augmenting LLaVA-v1.5-Instruct with PGT data yields gains of up to +20% on What'sUp and +13.3% on CV-Bench-2D, with smaller gains (+5.5% and +8.3%) when finetuning SOTA models, while preserving general perception capabilities. The authors conclude that many spatial reasoning deficits arise from inadequate supervision rather than architectural or resolution limitations.

Significance. If the central interpretation holds after appropriate controls, the work would be significant for multimodal learning research. It offers a low-cost, procedurally generated data augmentation strategy that targets visual grounding without altering model architecture or resolution. The dual diagnostic role could help isolate perception failures. Demonstrating gains across architectures and no degradation in general capabilities strengthens the practical value. The emphasis on supervision as the primary bottleneck could redirect attention toward targeted data curation for spatial tasks in MLLMs.

major comments (2)

[§4 (Experiments), Table 2] §4 (Experiments), Table 2: The headline gains (+20% on What'sUp, +13.3% on CV-Bench-2D) compare LLaVA-v1.5-Instruct to the same model augmented with PGT data, but no ablation holds total training data volume fixed by adding an equivalent number of non-geometric synthetic examples. This control is required to substantiate that the geometric primitives supply a qualitatively different signal that disentangles visual grounding from semantic priors, rather than the gains arising from generic data-augmentation effects.
[§4.3 (Ablations) and §5 (Conclusion)] §4.3 (Ablations) and §5 (Conclusion): The claim that spatial deficits 'stem from inadequate supervision signals rather than inherent architectural or resolution limitations' is not directly tested by experiments that vary model scale or input resolution while keeping the original training mixture fixed. Without such tests, the interpretation that PGT addresses the primary source of the observed deficits remains under-supported.

minor comments (2)

[Abstract] Abstract: The claim of 'extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks' is not accompanied by an explicit list of all evaluated benchmarks beyond What'sUp and CV-Bench-2D; adding this would improve transparency.
[§3 (PGT Framework)] §3 (PGT Framework): The procedural generation process for the geometric primitives and associated QA pairs could include pseudocode or additional implementation details to facilitate reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below, providing clarifications and indicating where revisions will be made.

read point-by-point responses

Referee: [§4 (Experiments), Table 2] §4 (Experiments), Table 2: The headline gains (+20% on What'sUp, +13.3% on CV-Bench-2D) compare LLaVA-v1.5-Instruct to the same model augmented with PGT data, but no ablation holds total training data volume fixed by adding an equivalent number of non-geometric synthetic examples. This control is required to substantiate that the geometric primitives supply a qualitatively different signal that disentangles visual grounding from semantic priors, rather than the gains arising from generic data-augmentation effects.

Authors: We agree that an ablation holding total training data volume fixed with an equivalent number of non-geometric synthetic examples would strengthen the evidence that the geometric primitives provide a distinct signal. Our existing ablations in §4.3 compare PGT against other augmentation strategies and show that gains are specific to the geometric overlays. We will incorporate this additional control experiment in the revised manuscript to directly address the concern. revision: yes
Referee: [§4.3 (Ablations) and §5 (Conclusion)] §4.3 (Ablations) and §5 (Conclusion): The claim that spatial deficits 'stem from inadequate supervision signals rather than inherent architectural or resolution limitations' is not directly tested by experiments that vary model scale or input resolution while keeping the original training mixture fixed. Without such tests, the interpretation that PGT addresses the primary source of the observed deficits remains under-supported.

Authors: The experiments demonstrate consistent gains from PGT across fixed architectures and input resolutions, indicating that supervision can substantially mitigate the observed spatial deficits. We acknowledge that experiments varying model scale or resolution while holding the training mixture fixed would provide more direct support for the interpretation. We will revise the language in §5 and the conclusion to more precisely frame the findings as evidence that inadequate supervision is a key addressable factor, rather than asserting it as the sole primary source. revision: partial

Circularity Check

0 steps flagged

Empirical augmentation study with no circular derivations or self-referential predictions

full rationale

The paper is an empirical study that augments training data with procedurally generated geometric overlays and reports benchmark gains on external tasks (What'sUp, CV-Bench-2D). No equations, first-principles derivations, or predictions appear in the provided text. No self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are present. Results are measured against held-out benchmarks and therefore remain falsifiable outside the training mixture. This is the normal non-circular outcome for a data-augmentation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of procedurally generated geometric supervision; no new physical entities are postulated and the method uses standard supervised fine-tuning assumptions.

axioms (1)

domain assumption Additional dense visual supervision from unambiguous primitives will improve grounding without harming general capabilities
Invoked when claiming that PGT-augmented training yields gains while maintaining perception capabilities.

pith-pipeline@v0.9.0 · 5767 in / 1246 out tokens · 31464 ms · 2026-05-25T04:38:42.291077+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[3]

2024 , eprint=

When are Lemons Purple? The Concept Association Bias of Vision-Language Models , author=. 2024 , eprint=

work page 2024
[4]

M. J. Kearns , title =

work page
[5]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[6]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[7]

Suppressed for Anonymity , author=

work page
[8]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[9]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[10]

2025 , eprint=

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models , author=. 2025 , eprint=

work page 2025
[11]

2025 , eprint=

Visual Jigsaw Post-Training Improves MLLMs , author=. 2025 , eprint=

work page 2025
[12]

2025 , eprint=

Grounded Reinforcement Learning for Visual Reasoning , author=. 2025 , eprint=

work page 2025
[13]

2023 , eprint=

ViperGPT: Visual Inference via Python Execution for Reasoning , author=. 2023 , eprint=

work page 2023
[14]

2022 , eprint=

Visual Programming: Compositional visual reasoning without training , author=. 2022 , eprint=

work page 2022
[15]

2025 , eprint=

Locality Alignment Improves Vision-Language Models , author=. 2025 , eprint=

work page 2025
[16]

2024 , eprint=

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. 2024 , eprint=

work page 2024
[17]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021
[18]

2024 , eprint=

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models , author=. 2024 , eprint=

work page 2024
[19]

2025 , eprint=

Visual Representation Alignment for Multimodal Large Language Models , author=. 2025 , eprint=

work page 2025
[20]

2025 , eprint=

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding , author=. 2025 , eprint=

work page 2025
[21]

2024 , eprint=

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models , author=. 2024 , eprint=

work page 2024
[22]

2024 , eprint=

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models , author=. 2024 , eprint=

work page 2024
[23]

2024 , eprint=

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models , author=. 2024 , eprint=

work page 2024
[24]

2025 , eprint=

Vision Transformers with Self-Distilled Registers , author=. 2025 , eprint=

work page 2025
[25]

2025 , eprint=

Object-centric Binding in Contrastive Language-Image Pretraining , author=. 2025 , eprint=

work page 2025
[26]

2025 , eprint=

Hidden in plain sight: VLMs overlook their visual representations , author=. 2025 , eprint=

work page 2025
[27]

2024 , eprint=

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs , author=. 2024 , eprint=

work page 2024
[28]

2025 , eprint=

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features , author=. 2025 , eprint=

work page 2025
[29]

2025 , eprint=

FeatSharp: Your Vision Model Features, Sharper , author=. 2025 , eprint=

work page 2025
[30]

2024 , eprint=

Vision Transformers Need Registers , author=. 2024 , eprint=

work page 2024
[31]

2023 , eprint=

What's "up" with vision-language models? Investigating their struggle with spatial reasoning , author=. 2023 , eprint=

work page 2023
[32]

2025 , month=

Spatial Intelligence in Vision-Language Models: A Comprehensive Survey , author=. 2025 , month=

work page 2025
[33]

2024 , eprint=

Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models , author=. 2024 , eprint=

work page 2024
[34]

2025 , eprint=

Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs , author=. 2025 , eprint=

work page 2025
[35]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[36]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[37]

2025 , eprint=

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. 2025 , eprint=

work page 2025
[38]

2024 , eprint=

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models , author=. 2024 , eprint=

work page 2024
[39]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[40]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021
[41]

2025 , eprint=

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author=. 2025 , eprint=

work page 2025
[42]

2025 , eprint=

Perception-R1: Pioneering Perception Policy with Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[43]

2025 , eprint=

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[44]

2024 , eprint=

BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. 2024 , eprint=

work page 2024
[45]

2026 , eprint=

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation , author=. 2026 , eprint=

work page 2026
[46]

Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Jian, Pu and Yu, Donglei and Yang, Wen and Ren, Shuo and Zhang, Jiajun. Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.182

work page doi:10.18653/v1/2025.acl-long.182 2025
[47]

2025 , eprint=

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces , author=. 2025 , eprint=

work page 2025
[48]

2023 , eprint=

Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023
[49]

2023 , eprint=

Visual Spatial Reasoning , author=. 2023 , eprint=

work page 2023
[50]

2025 , eprint=

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement , author=. 2025 , eprint=

work page 2025
[51]

2023 , eprint=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

work page 2023
[52]

2018 , eprint=

TallyQA: Answering Complex Counting Questions , author=. 2018 , eprint=

work page 2018
[53]

2024 , eprint=

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. 2024 , eprint=

work page 2024
[54]

2024 , url =

RealWorldQA: A Benchmark for Real-World Spatial Understanding , author =. 2024 , url =

work page 2024
[55]

Hiippala, Tuomo and Alikhani, Malihe and Haverinen, Jonas and Kalliokoski, Timo and Logacheva, Evanfiya and Orekhova, Serafina and Tuomainen, Aino and Stone, Matthew and Bateman, John A. , year=. AI2D-RST: a multimodal corpus of 1000 primary school science diagrams , volume=. Language Resources and Evaluation , publisher=. doi:10.1007/s10579-020-09517-1 , number=

work page doi:10.1007/s10579-020-09517-1
[56]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson and Christopher D. Manning , title =. CoRR , volume =. 2019 , url =. 1902.09506 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[57]

2023 , eprint=

Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

work page 2023
[58]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[59]

2026 , eprint=

ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps , author=. 2026 , eprint=

work page 2026
[60]

2025 , eprint=

Can Large Vision Language Models Read Maps Like a Human? , author=. 2025 , eprint=

work page 2025
[61]

2023 , eprint=

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs , author=. 2023 , eprint=

work page 2023

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980

[3] [3]

2024 , eprint=

When are Lemons Purple? The Concept Association Bias of Vision-Language Models , author=. 2024 , eprint=

work page 2024

[4] [4]

M. J. Kearns , title =

work page

[5] [5]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983

[6] [6]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000

[7] [7]

Suppressed for Anonymity , author=

work page

[8] [8]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981

[9] [9]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959

[10] [10]

2025 , eprint=

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models , author=. 2025 , eprint=

work page 2025

[11] [11]

2025 , eprint=

Visual Jigsaw Post-Training Improves MLLMs , author=. 2025 , eprint=

work page 2025

[12] [12]

2025 , eprint=

Grounded Reinforcement Learning for Visual Reasoning , author=. 2025 , eprint=

work page 2025

[13] [13]

2023 , eprint=

ViperGPT: Visual Inference via Python Execution for Reasoning , author=. 2023 , eprint=

work page 2023

[14] [14]

2022 , eprint=

Visual Programming: Compositional visual reasoning without training , author=. 2022 , eprint=

work page 2022

[15] [15]

2025 , eprint=

Locality Alignment Improves Vision-Language Models , author=. 2025 , eprint=

work page 2025

[16] [16]

2024 , eprint=

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs , author=. 2024 , eprint=

work page 2024

[17] [17]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021

[18] [18]

2024 , eprint=

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models , author=. 2024 , eprint=

work page 2024

[19] [19]

2025 , eprint=

Visual Representation Alignment for Multimodal Large Language Models , author=. 2025 , eprint=

work page 2025

[20] [20]

2025 , eprint=

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding , author=. 2025 , eprint=

work page 2025

[21] [21]

2024 , eprint=

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models , author=. 2024 , eprint=

work page 2024

[22] [22]

2024 , eprint=

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models , author=. 2024 , eprint=

work page 2024

[23] [23]

2024 , eprint=

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models , author=. 2024 , eprint=

work page 2024

[24] [24]

2025 , eprint=

Vision Transformers with Self-Distilled Registers , author=. 2025 , eprint=

work page 2025

[25] [25]

2025 , eprint=

Object-centric Binding in Contrastive Language-Image Pretraining , author=. 2025 , eprint=

work page 2025

[26] [26]

2025 , eprint=

Hidden in plain sight: VLMs overlook their visual representations , author=. 2025 , eprint=

work page 2025

[27] [27]

2024 , eprint=

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs , author=. 2024 , eprint=

work page 2024

[28] [28]

2025 , eprint=

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features , author=. 2025 , eprint=

work page 2025

[29] [29]

2025 , eprint=

FeatSharp: Your Vision Model Features, Sharper , author=. 2025 , eprint=

work page 2025

[30] [30]

2024 , eprint=

Vision Transformers Need Registers , author=. 2024 , eprint=

work page 2024

[31] [31]

2023 , eprint=

What's "up" with vision-language models? Investigating their struggle with spatial reasoning , author=. 2023 , eprint=

work page 2023

[32] [32]

2025 , month=

Spatial Intelligence in Vision-Language Models: A Comprehensive Survey , author=. 2025 , month=

work page 2025

[33] [33]

2024 , eprint=

Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models , author=. 2024 , eprint=

work page 2024

[34] [34]

2025 , eprint=

Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs , author=. 2025 , eprint=

work page 2025

[35] [35]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[36] [36]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025

[37] [37]

2025 , eprint=

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. 2025 , eprint=

work page 2025

[38] [38]

2024 , eprint=

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models , author=. 2024 , eprint=

work page 2024

[39] [39]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023

[40] [40]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021

[41] [41]

2025 , eprint=

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author=. 2025 , eprint=

work page 2025

[42] [42]

2025 , eprint=

Perception-R1: Pioneering Perception Policy with Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[43] [43]

2025 , eprint=

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[44] [44]

2024 , eprint=

BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. 2024 , eprint=

work page 2024

[45] [45]

2026 , eprint=

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation , author=. 2026 , eprint=

work page 2026

[46] [46]

Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Jian, Pu and Yu, Donglei and Yang, Wen and Ren, Shuo and Zhang, Jiajun. Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.182

work page doi:10.18653/v1/2025.acl-long.182 2025

[47] [47]

2025 , eprint=

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces , author=. 2025 , eprint=

work page 2025

[48] [48]

2023 , eprint=

Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023

[49] [49]

2023 , eprint=

Visual Spatial Reasoning , author=. 2023 , eprint=

work page 2023

[50] [50]

2025 , eprint=

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement , author=. 2025 , eprint=

work page 2025

[51] [51]

2023 , eprint=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

work page 2023

[52] [52]

2018 , eprint=

TallyQA: Answering Complex Counting Questions , author=. 2018 , eprint=

work page 2018

[53] [53]

2024 , eprint=

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. 2024 , eprint=

work page 2024

[54] [54]

2024 , url =

RealWorldQA: A Benchmark for Real-World Spatial Understanding , author =. 2024 , url =

work page 2024

[55] [55]

Hiippala, Tuomo and Alikhani, Malihe and Haverinen, Jonas and Kalliokoski, Timo and Logacheva, Evanfiya and Orekhova, Serafina and Tuomainen, Aino and Stone, Matthew and Bateman, John A. , year=. AI2D-RST: a multimodal corpus of 1000 primary school science diagrams , volume=. Language Resources and Evaluation , publisher=. doi:10.1007/s10579-020-09517-1 , number=

work page doi:10.1007/s10579-020-09517-1

[56] [56]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson and Christopher D. Manning , title =. CoRR , volume =. 2019 , url =. 1902.09506 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019

[57] [57]

2023 , eprint=

Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

work page 2023

[58] [58]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025

[59] [59]

2026 , eprint=

ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps , author=. 2026 , eprint=

work page 2026

[60] [60]

2025 , eprint=

Can Large Vision Language Models Read Maps Like a Human? , author=. 2025 , eprint=

work page 2025

[61] [61]

2023 , eprint=

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs , author=. 2023 , eprint=

work page 2023