pith. sign in

arxiv: 2605.15672 · v1 · pith:UYHPPMKUnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

Pith reviewed 2026-05-20 20:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsline tracingpath followinglocal competitionvisual reasoningmodel failuresmultimodal benchmarks
0
0 comments X

The pith

Vision-language models lose a target visual path and switch to nearby similar alternatives because local competition overrides the intended continuation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests vision-language models on controlled line-tracing tasks where a model must follow one selected path through successive local choices amid nearby competing lines. Even advanced models frequently abandon the target path for a locally similar distractor. This switching stems from local competition rather than global path awareness. Standard fixes such as larger model size, added reasoning steps, and explicit instructions provide only partial or costly relief, and the same failures appear in richer scenes like tangled cables and metro maps.

Core claim

The authors establish that state-of-the-art vision-language models lack robust line-tracing ability: they lose the selected path and switch to nearby alternatives especially when those alternatives match locally, and this pattern arises from local competition that persists across model scales, prompting strategies, and more complex visual settings.

What carries the argument

Local competition from nearby similar distractors that diverts attention away from the true path continuation in successive visual decisions.

If this is right

  • Increasing model size yields only limited gains in stable path following.
  • Reasoning prompts can compensate but rely on expensive substitute strategies instead of direct tracing.
  • Explicit instructions to trace the path do not produce reliable long-term adherence.
  • The same path-switching errors occur in untangled real scenes such as cables and maps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar local competition may affect other sequential visual tasks that require maintaining a global structure over many steps.
  • Applications needing precise visual navigation, such as diagram reading or route following, could inherit this brittleness.
  • Architectures that explicitly track a chosen path across steps might reduce reliance on purely local decisions.
  • Attention or activation analysis during these failures could point to targeted fixes without full retraining.

Load-bearing premise

The controlled tracing tasks isolate pure line-following ability by adding nearby competitors without introducing extra semantic or topological confusion.

What would settle it

Consistent maintenance of the original path by current VLMs across multiple trials with added locally similar alternatives would contradict the reported failure pattern.

Figures

Figures reproduced from arXiv: 2605.15672 by Albert No, Dongjae Jeon, Hyesoo Hong, Minsoo Kim, Sangyeon Yoon, Wonje Jeung.

Figure 1
Figure 1. Figure 1: Two tracing settings and the corresponding model performance. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Adjacent-turn jumps dominate Swirl errors. Counts of adjacent-turn jumps and other errors across models and rotation-density levels on the Swirl task. While the circuit task provides a realistic multi-wire tracing setting, it still contains ports, component la￾bels, and circuit-specific layout structure. To elimi￾nate these as well, we design Swirl, an abstract dataset reduced to the tracing requirement al… view at source ↗
Figure 3
Figure 3. Figure 3: Controlled similarity conditions and causal masking setup. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Internal selectivity under increasing shared local structure. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention compari￾son setup. Attention from Red on queried path to the correct next dot Green versus the com￾peting distractor Dist-Red. For attention preference, we compute ∆ (ℓ) attn = a (ℓ) Red→Green − a (ℓ) Red→Dist-Red, where ℓ indexes the vision-encoder block, a (ℓ) Red→Green is the attention from the queried Red dot to the true next dot Green, and a (ℓ) Red→Dist-Red is the maximum attention from the… view at source ↗
Figure 6
Figure 6. Figure 6: Qwen3-VL model-size scaling on Swirl task. Accuracy across rotation-density lev￾els for models from 2B to 235B parameters. We examine training-time scaling by comparing mod￾els within the Qwen3-VL family, spanning 2B to 235B parameters. This within-family comparison tests whether increased model scale translates into more reliable visual tracing [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reasoning improves performance, but does not recover genuine tracing. (a) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Instructing models to trace does not recover performance. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Realistic path-tracing examples. Metro maps (left) and tangled HANDLOOM cables (right), where the selected path must be traced through crossings, overlaps, and nearby competing structures. In previous sections, we stress-tested diverse VLMs and identified adjacent jumps, a failure in which the model abruptly switches from the queried path to a nearby alternative. This failure persisted across our controlle… view at source ↗
Figure 10
Figure 10. Figure 10: Examples from the modified Circuit Connections task. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: InternVL3.5-8B internal preference under nearby distractors. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Keyword-based non-tracing pattern comparison. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Progressive sequence break￾down on HANDLOOM cable tracing. Pre￾fix accuracy on 24 cable images, plotted over the first 15 of 30 colored-dot steps. We evaluate reasoning-enabled models on tangled cable images from HANDLOOM [35]. In this setting, the tar￾get cable must be traced through scenes containing self￾overlap, crossings, and nearby cable segments. We place 30 colored dots along each target cable, an… view at source ↗
Figure 14
Figure 14. Figure 14: Average attention to true next dots and nearby cable competitors. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cropped HANDLOOM cable attention examples. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Examples from the metro-map set. The set contains (a) cropped regions from real transit maps and (b) synthetic images rendered in a metro-map style. Crops are selected to preserve local route ambiguity while avoiding regions dominated by extreme clutter or unreadable text. I.2 Metro Maps Following the HANDLOOM cable probe, we also evaluate reasoning-enabled models in a visually richer practical diagrammat… view at source ↗
Figure 17
Figure 17. Figure 17: Reasoning traces on a Los Angeles metro-map crop. [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Reasoning traces on a synthetic metro-map image. [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Reasoning traces on a synthetic metro-map image. [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
read the original abstract

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that vision-language models (VLMs) fail at line-tracing tasks by losing the target path and switching to locally similar alternatives. It introduces controlled tracing tasks that add nearby competitors while minimizing crossings, overlaps, and semantic ambiguity. Behavioral interventions and internal analyses are presented as evidence that these switches stem from local competition. The work further reports that model scaling, chain-of-thought reasoning, and explicit tracing instructions provide only partial or costly mitigation, and that the same failure mode appears in more complex scenes such as tangled cables and metro maps.

Significance. If the causal attribution to local competition holds, the result identifies a concrete, architecture-level limitation in current VLMs that is not resolved by scale or prompting. The controlled-task design supplies a reproducible probe for visual continuity that could be adopted by others studying multimodal robustness.

major comments (2)
  1. [Abstract] Abstract (behavioral interventions and internal analyses): the attribution of path-switching failures specifically to local competition is the central claim, yet the supporting analyses are described only at a high level. Without targeted interventions (e.g., attention-score editing on distractors) or explicit controls that separate local competition from low-level factors such as patch-based encoding limits or prompt-induced global context loss, the evidence remains consistent with multiple mechanisms and does not yet isolate the proposed cause.
  2. [Abstract] Controlled tracing tasks (Abstract): the claim that these tasks isolate line-tracing ability rests on the assumption that nearby competitors are the dominant variable once semantic and topological ambiguities are reduced. No quantitative metrics of local similarity or ablation of other perceptual confounds are referenced, which leaves open the possibility that observed switches reflect broader visual encoding weaknesses rather than competition per se.
minor comments (2)
  1. The manuscript would be strengthened by reporting sample sizes, number of trials per condition, and any statistical tests used to support the behavioral and internal-analysis results.
  2. Clarify whether the internal analyses consist of attention visualizations, activation correlations, or other techniques, and indicate where the corresponding figures or methods appear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where our evidence for local competition could be strengthened with additional controls and metrics. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract (behavioral interventions and internal analyses): the attribution of path-switching failures specifically to local competition is the central claim, yet the supporting analyses are described only at a high level. Without targeted interventions (e.g., attention-score editing on distractors) or explicit controls that separate local competition from low-level factors such as patch-based encoding limits or prompt-induced global context loss, the evidence remains consistent with multiple mechanisms and does not yet isolate the proposed cause.

    Authors: We agree that stronger causal isolation would improve the central claim. The original behavioral interventions systematically varied distractor similarity while fixing global scene properties and prompt structure, and the internal analyses compared attention maps and activation similarities between target and competitor paths. These results are consistent with local competition but, as noted, remain partly correlational. In the revised manuscript we have added explicit attention-score editing experiments (zeroing attention to distractor patches in selected layers) that measurably reduce switch rates, together with ablations that vary patch size and prompt context length. The new results appear in Section 4.3 and Appendix B and help separate local competition from the low-level confounds mentioned. revision: yes

  2. Referee: [Abstract] Controlled tracing tasks (Abstract): the claim that these tasks isolate line-tracing ability rests on the assumption that nearby competitors are the dominant variable once semantic and topological ambiguities are reduced. No quantitative metrics of local similarity or ablation of other perceptual confounds are referenced, which leaves open the possibility that observed switches reflect broader visual encoding weaknesses rather than competition per se.

    Authors: We acknowledge that quantitative grounding for the dominance of local competitors strengthens the task design. The original tasks minimized crossings and semantic overlap by construction, but similarity was controlled qualitatively. In the revision we introduce a local similarity metric (mean cosine similarity of vision-encoder patch embeddings within a sliding window around each path segment) and report its correlation with switch rates. We further add ablation conditions that introduce non-competitor confounds (additive noise, contrast reduction) while removing nearby lines; switch rates remain substantially lower than in the competitor-present conditions. These metrics and ablations are now reported in Section 3.2 and Appendix A. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical diagnosis of VLM tracing failures

full rationale

The paper is an empirical diagnostic study that designs controlled line-tracing tasks, measures VLM performance on them, applies behavioral interventions, and reports internal analyses. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the abstract or described content. Claims rest on direct task outcomes and interventions that are independently verifiable against external model evaluations rather than reducing to inputs by construction. Self-citations, if present, are not load-bearing for the central empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study of model behavior on visual tasks and does not rely on mathematical derivations or new theoretical postulates.

axioms (1)
  • domain assumption VLMs can be prompted and evaluated on image-based continuation tasks
    Standard assumption underlying all VLM behavioral testing.

pith-pipeline@v0.9.0 · 5719 in / 1104 out tokens · 63109 ms · 2026-05-20T20:08:50.499555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

  1. [1]

    Introducing claude sonnet 4.5

    Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, 2025. Accessed: 2026-04-07

  2. [2]

    Visual symbolic mechanisms: Emergent symbol processing in vision language models

    Rim Assouel, Declan Campbell, Yoshua Bengio, and Taylor Webb. Visual symbolic mechanisms: Emergent symbol processing in vision language models. InICLR, 2026

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  4. [4]

    Vlms have tunnel vision: Evaluating nonlocal visual reasoning in leading vlms

    Shmuel Berman and Jia Deng. Vlms have tunnel vision: Evaluating nonlocal visual reasoning in leading vlms. InNeurIPS, 2025

  5. [5]

    Understanding the limits of vision language models through the lens of the binding problem

    Declan Campbell, Sunayana Rane, Tyler Giallanza, Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven M Frankland, Thomas L Griffiths, Jonathan D Cohen, et al. Understanding the limits of vision language models through the lens of the binding problem. In NeurIPS, 2024

  6. [6]

    Response wide shut? surprising observations in basic vision language model capabilities

    Shivam Chandhok, Wan-Cyuan Fan, Vered Shwartz, Vineeth N Balasubramanian, and Leonid Sigal. Response wide shut? surprising observations in basic vision language model capabilities. InACL, 2025

  7. [7]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In CVPR, 2024

  8. [8]

    Babyvision: Visual reasoning beyond language

    Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language. arXiv preprint arXiv:2601.06521, 2026

  9. [9]

    Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas

    Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. InICML, 2025

  10. [10]

    Re- wardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning

    Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. Re- wardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning. InICLR, 2026

  11. [11]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InECCV, 2024

  12. [12]

    Gemini 3 flash

    Google. Gemini 3 flash. https://deepmind.google/models/gemini/flash/, 2025. Ac- cessed: 2026-04-27

  13. [13]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024

  14. [14]

    Do vision-language models really understand visual language? InICML, 2025

    Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual language? InICML, 2025

  15. [15]

    Roelfsema

    Rinus Houtkamp, Henk Spekreijse, and Pieter R. Roelfsema. A gradual spread of attention during mental curve tracing.Perception & Psychophysics, 2003. 10

  16. [16]

    Why vision language models struggle with visual arithmetic? towards enhanced chart and geometry understanding

    Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. Why vision language models struggle with visual arithmetic? towards enhanced chart and geometry understanding. InFindings of ACL, 2025

  17. [17]

    Curve tracing: A possible basic operation in the perception of spatial relations.Memory & Cognition, 1986

    Pierre Jolicoeur, Shimon Ullman, and Marilynn Mackay. Curve tracing: A possible basic operation in the perception of spatial relations.Memory & Cognition, 1986

  18. [18]

    What’s “up” with vision-language models? investigating their struggle with spatial reasoning

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InEMNLP, 2023

  19. [19]

    Visonlyqa: Large vision language models still struggle with visual perception of geometric information

    Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, and Rui Zhang. Visonlyqa: Large vision language models still struggle with visual perception of geometric information. InCOLM, 2025

  20. [20]

    Vlind-bench: Measuring language priors in large vision-language models

    Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind-bench: Measuring language priors in large vision-language models. In Findings of NAACL, 2025

  21. [21]

    Visiongraph: Leveraging large multimodal models for graph theory problems in visual context

    Yunxin Li, Baotian Hu, Haoyuan Shi, Wei Wang, Longyue Wang, and Min Zhang. Visiongraph: Leveraging large multimodal models for graph theory problems in visual context. InICML, 2024

  22. [22]

    Learning long-range spatial dependencies with horizontal gated recurrent units

    Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and Thomas Serre. Learning long-range spatial dependencies with horizontal gated recurrent units. InNeurIPS, 2018

  23. [23]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

  24. [24]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024

  25. [25]

    Probing visual language priors in vlms

    Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, and Honglak Lee. Probing visual language priors in vlms. InICML, 2025

  26. [26]

    Unraveling the truth: Do vlms really understand charts? a deep dive into consistency and robustness

    Srija Mukhopadhyay, Adnan Qidwai, Aparna Garimella, Pritika Ramu, Vivek Gupta, and Dan Roth. Unraveling the truth: Do vlms really understand charts? a deep dive into consistency and robustness. InFindings of EMNLP, 2024

  27. [27]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ ,

  28. [28]

    Accessed: 2026-04-07

  29. [29]

    TraversalBench: Challenging Paths to Follow for Vision Language Models

    Clara Petrova, Zhuo Chen, and Marin Soljaˇci´c. Traversalbench: Challenging paths to follow for vision language models.arXiv preprint arXiv:2604.10999, 2026

  30. [30]

    Robert Pringle and Howard E. Egeth. Mental curve tracing with elementary stimuli.Journal of Experimental Psychology: Human Perception and Performance, 1988

  31. [31]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InACCV, 2024

  32. [32]

    Cortical algorithms for perceptual grouping.Annual Review of Neuro- science, 2006

    Pieter R Roelfsema. Cortical algorithms for perceptual grouping.Annual Review of Neuro- science, 2006

  33. [33]

    Incremental grouping of image elements in vision

    Pieter R Roelfsema and Roos Houtkamp. Incremental grouping of image elements in vision. Attention, Perception, & Psychophysics, 2011

  34. [34]

    Steven Scholte, Henk Spekreijse, and Pieter R

    H. Steven Scholte, Henk Spekreijse, and Pieter R. Roelfsema. The spatial profile of visual attention in mental curve tracing.Vision Research, 2001

  35. [35]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 11

  36. [36]

    Handloom: Learned tracing of one-dimensional objects for inspection and manipulation

    Vainavi Viswanath, Kaushik Shivakumar, Mallika Parulekar, Jainil Ajmera, Justin Kerr, Jeffrey Ichnowski, Richard Cheng, Thomas Kollar, and Ken Goldberg. Handloom: Learned tracing of one-dimensional objects for inspection and manipulation. InCoRL, 2023

  37. [37]

    Vision language models are biased

    An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased. InICLR, 2026

  38. [38]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

  39. [39]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  40. [40]

    Do mllms really understand the charts?arXiv preprint arXiv:2509.04457, 2025

    Xiao Zhang, Dongyuan Li, Liuyu Xiang, Yao Zhang, Cheng Zhong, and Zhaofeng He. Do mllms really understand the charts?arXiv preprint arXiv:2509.04457, 2025

  41. [41]

    S” at the start and “E

    Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Lu Qi, and Xiangtai Li. Are they the same? exploring visual correspondence shortcomings of multimodal llms. InICCV, 2025. 12 A Limitations Our study focuses on a specific visual primitive: preserving the identity of a selected path under nearby local co...

  42. [42]

    Start from the point labeled “S” and treat it as the current position

  43. [43]

    Select the dot that is visibly connected to the current position by a continuous line; use this line connection as the sole criterion for selection

  44. [44]

    For example, do not select a dot simply because it is close to the current position, or because it lies on a nearby parallel segment

    Do not select a dot based on proximity or visual similarity to the path. For example, do not select a dot simply because it is close to the current position, or because it lies on a nearby parallel segment

  45. [45]

    S”, and you see a red dot connected to “S

    Move to the selected dot, treat it as the new current position, and repeat until the path ends. Example: Suppose the current position is “S”, and you see a red dot connected to “S” by a continuous line, and a blue dot nearby but on a different path. Even if blue appears closer or lies in the same direction, you select red because it is connected by the li...