VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following
Pith reviewed 2026-05-20 20:08 UTC · model grok-4.3
The pith
Vision-language models lose a target visual path and switch to nearby similar alternatives because local competition overrides the intended continuation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that state-of-the-art vision-language models lack robust line-tracing ability: they lose the selected path and switch to nearby alternatives especially when those alternatives match locally, and this pattern arises from local competition that persists across model scales, prompting strategies, and more complex visual settings.
What carries the argument
Local competition from nearby similar distractors that diverts attention away from the true path continuation in successive visual decisions.
If this is right
- Increasing model size yields only limited gains in stable path following.
- Reasoning prompts can compensate but rely on expensive substitute strategies instead of direct tracing.
- Explicit instructions to trace the path do not produce reliable long-term adherence.
- The same path-switching errors occur in untangled real scenes such as cables and maps.
Where Pith is reading between the lines
- Similar local competition may affect other sequential visual tasks that require maintaining a global structure over many steps.
- Applications needing precise visual navigation, such as diagram reading or route following, could inherit this brittleness.
- Architectures that explicitly track a chosen path across steps might reduce reliance on purely local decisions.
- Attention or activation analysis during these failures could point to targeted fixes without full retraining.
Load-bearing premise
The controlled tracing tasks isolate pure line-following ability by adding nearby competitors without introducing extra semantic or topological confusion.
What would settle it
Consistent maintenance of the original path by current VLMs across multiple trials with added locally similar alternatives would contradict the reported failure pattern.
Figures
read the original abstract
Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that vision-language models (VLMs) fail at line-tracing tasks by losing the target path and switching to locally similar alternatives. It introduces controlled tracing tasks that add nearby competitors while minimizing crossings, overlaps, and semantic ambiguity. Behavioral interventions and internal analyses are presented as evidence that these switches stem from local competition. The work further reports that model scaling, chain-of-thought reasoning, and explicit tracing instructions provide only partial or costly mitigation, and that the same failure mode appears in more complex scenes such as tangled cables and metro maps.
Significance. If the causal attribution to local competition holds, the result identifies a concrete, architecture-level limitation in current VLMs that is not resolved by scale or prompting. The controlled-task design supplies a reproducible probe for visual continuity that could be adopted by others studying multimodal robustness.
major comments (2)
- [Abstract] Abstract (behavioral interventions and internal analyses): the attribution of path-switching failures specifically to local competition is the central claim, yet the supporting analyses are described only at a high level. Without targeted interventions (e.g., attention-score editing on distractors) or explicit controls that separate local competition from low-level factors such as patch-based encoding limits or prompt-induced global context loss, the evidence remains consistent with multiple mechanisms and does not yet isolate the proposed cause.
- [Abstract] Controlled tracing tasks (Abstract): the claim that these tasks isolate line-tracing ability rests on the assumption that nearby competitors are the dominant variable once semantic and topological ambiguities are reduced. No quantitative metrics of local similarity or ablation of other perceptual confounds are referenced, which leaves open the possibility that observed switches reflect broader visual encoding weaknesses rather than competition per se.
minor comments (2)
- The manuscript would be strengthened by reporting sample sizes, number of trials per condition, and any statistical tests used to support the behavioral and internal-analysis results.
- Clarify whether the internal analyses consist of attention visualizations, activation correlations, or other techniques, and indicate where the corresponding figures or methods appear.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where our evidence for local competition could be strengthened with additional controls and metrics. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract] Abstract (behavioral interventions and internal analyses): the attribution of path-switching failures specifically to local competition is the central claim, yet the supporting analyses are described only at a high level. Without targeted interventions (e.g., attention-score editing on distractors) or explicit controls that separate local competition from low-level factors such as patch-based encoding limits or prompt-induced global context loss, the evidence remains consistent with multiple mechanisms and does not yet isolate the proposed cause.
Authors: We agree that stronger causal isolation would improve the central claim. The original behavioral interventions systematically varied distractor similarity while fixing global scene properties and prompt structure, and the internal analyses compared attention maps and activation similarities between target and competitor paths. These results are consistent with local competition but, as noted, remain partly correlational. In the revised manuscript we have added explicit attention-score editing experiments (zeroing attention to distractor patches in selected layers) that measurably reduce switch rates, together with ablations that vary patch size and prompt context length. The new results appear in Section 4.3 and Appendix B and help separate local competition from the low-level confounds mentioned. revision: yes
-
Referee: [Abstract] Controlled tracing tasks (Abstract): the claim that these tasks isolate line-tracing ability rests on the assumption that nearby competitors are the dominant variable once semantic and topological ambiguities are reduced. No quantitative metrics of local similarity or ablation of other perceptual confounds are referenced, which leaves open the possibility that observed switches reflect broader visual encoding weaknesses rather than competition per se.
Authors: We acknowledge that quantitative grounding for the dominance of local competitors strengthens the task design. The original tasks minimized crossings and semantic overlap by construction, but similarity was controlled qualitatively. In the revision we introduce a local similarity metric (mean cosine similarity of vision-encoder patch embeddings within a sliding window around each path segment) and report its correlation with switch rates. We further add ablation conditions that introduce non-competitor confounds (additive noise, contrast reduction) while removing nearby lines; switch rates remain substantially lower than in the competitor-present conditions. These metrics and ablations are now reported in Section 3.2 and Appendix A. revision: yes
Circularity Check
No circularity in empirical diagnosis of VLM tracing failures
full rationale
The paper is an empirical diagnostic study that designs controlled line-tracing tasks, measures VLM performance on them, applies behavioral interventions, and reports internal analyses. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the abstract or described content. Claims rest on direct task outcomes and interventions that are independently verifiable against external model evaluations rather than reducing to inputs by construction. Self-citations, if present, are not load-bearing for the central empirical findings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLMs can be prompted and evaluated on image-based continuation tasks
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, 2025. Accessed: 2026-04-07
work page 2025
-
[2]
Visual symbolic mechanisms: Emergent symbol processing in vision language models
Rim Assouel, Declan Campbell, Yoshua Bengio, and Taylor Webb. Visual symbolic mechanisms: Emergent symbol processing in vision language models. InICLR, 2026
work page 2026
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Vlms have tunnel vision: Evaluating nonlocal visual reasoning in leading vlms
Shmuel Berman and Jia Deng. Vlms have tunnel vision: Evaluating nonlocal visual reasoning in leading vlms. InNeurIPS, 2025
work page 2025
-
[5]
Understanding the limits of vision language models through the lens of the binding problem
Declan Campbell, Sunayana Rane, Tyler Giallanza, Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven M Frankland, Thomas L Griffiths, Jonathan D Cohen, et al. Understanding the limits of vision language models through the lens of the binding problem. In NeurIPS, 2024
work page 2024
-
[6]
Response wide shut? surprising observations in basic vision language model capabilities
Shivam Chandhok, Wan-Cyuan Fan, Vered Shwartz, Vineeth N Balasubramanian, and Leonid Sigal. Response wide shut? surprising observations in basic vision language model capabilities. InACL, 2025
work page 2025
-
[7]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In CVPR, 2024
work page 2024
-
[8]
Babyvision: Visual reasoning beyond language
Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language. arXiv preprint arXiv:2601.06521, 2026
-
[9]
Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas
Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. InICML, 2025
work page 2025
-
[10]
Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. Re- wardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning. InICLR, 2026
work page 2026
-
[11]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InECCV, 2024
work page 2024
-
[12]
Google. Gemini 3 flash. https://deepmind.google/models/gemini/flash/, 2025. Ac- cessed: 2026-04-27
work page 2025
-
[13]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024
work page 2024
-
[14]
Do vision-language models really understand visual language? InICML, 2025
Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual language? InICML, 2025
work page 2025
- [15]
-
[16]
Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. Why vision language models struggle with visual arithmetic? towards enhanced chart and geometry understanding. InFindings of ACL, 2025
work page 2025
-
[17]
Pierre Jolicoeur, Shimon Ullman, and Marilynn Mackay. Curve tracing: A possible basic operation in the perception of spatial relations.Memory & Cognition, 1986
work page 1986
-
[18]
What’s “up” with vision-language models? investigating their struggle with spatial reasoning
Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InEMNLP, 2023
work page 2023
-
[19]
Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, and Rui Zhang. Visonlyqa: Large vision language models still struggle with visual perception of geometric information. InCOLM, 2025
work page 2025
-
[20]
Vlind-bench: Measuring language priors in large vision-language models
Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, and Kyomin Jung. Vlind-bench: Measuring language priors in large vision-language models. In Findings of NAACL, 2025
work page 2025
-
[21]
Visiongraph: Leveraging large multimodal models for graph theory problems in visual context
Yunxin Li, Baotian Hu, Haoyuan Shi, Wei Wang, Longyue Wang, and Min Zhang. Visiongraph: Leveraging large multimodal models for graph theory problems in visual context. InICML, 2024
work page 2024
-
[22]
Learning long-range spatial dependencies with horizontal gated recurrent units
Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and Thomas Serre. Learning long-range spatial dependencies with horizontal gated recurrent units. InNeurIPS, 2018
work page 2018
-
[23]
Mmbench: Is your multi-modal model an all-around player? InECCV, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024
work page 2024
-
[24]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024
work page 2024
-
[25]
Probing visual language priors in vlms
Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, and Honglak Lee. Probing visual language priors in vlms. InICML, 2025
work page 2025
-
[26]
Unraveling the truth: Do vlms really understand charts? a deep dive into consistency and robustness
Srija Mukhopadhyay, Adnan Qidwai, Aparna Garimella, Pritika Ramu, Vivek Gupta, and Dan Roth. Unraveling the truth: Do vlms really understand charts? a deep dive into consistency and robustness. InFindings of EMNLP, 2024
work page 2024
-
[27]
OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ ,
-
[28]
Accessed: 2026-04-07
work page 2026
-
[29]
TraversalBench: Challenging Paths to Follow for Vision Language Models
Clara Petrova, Zhuo Chen, and Marin Soljaˇci´c. Traversalbench: Challenging paths to follow for vision language models.arXiv preprint arXiv:2604.10999, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Robert Pringle and Howard E. Egeth. Mental curve tracing with elementary stimuli.Journal of Experimental Psychology: Human Perception and Performance, 1988
work page 1988
-
[31]
Vision language models are blind
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InACCV, 2024
work page 2024
-
[32]
Cortical algorithms for perceptual grouping.Annual Review of Neuro- science, 2006
Pieter R Roelfsema. Cortical algorithms for perceptual grouping.Annual Review of Neuro- science, 2006
work page 2006
-
[33]
Incremental grouping of image elements in vision
Pieter R Roelfsema and Roos Houtkamp. Incremental grouping of image elements in vision. Attention, Perception, & Psychophysics, 2011
work page 2011
-
[34]
Steven Scholte, Henk Spekreijse, and Pieter R
H. Steven Scholte, Henk Spekreijse, and Pieter R. Roelfsema. The spatial profile of visual attention in mental curve tracing.Vision Research, 2001
work page 2001
-
[35]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 11
work page 2024
-
[36]
Handloom: Learned tracing of one-dimensional objects for inspection and manipulation
Vainavi Viswanath, Kaushik Shivakumar, Mallika Parulekar, Jainil Ajmera, Justin Kerr, Jeffrey Ichnowski, Richard Cheng, Thomas Kollar, and Ken Goldberg. Handloom: Learned tracing of one-dimensional objects for inspection and manipulation. InCoRL, 2023
work page 2023
-
[37]
Vision language models are biased
An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased. InICLR, 2026
work page 2026
-
[38]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
work page 2024
-
[40]
Do mllms really understand the charts?arXiv preprint arXiv:2509.04457, 2025
Xiao Zhang, Dongyuan Li, Liuyu Xiang, Yao Zhang, Cheng Zhong, and Zhaofeng He. Do mllms really understand the charts?arXiv preprint arXiv:2509.04457, 2025
-
[41]
Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Lu Qi, and Xiangtai Li. Are they the same? exploring visual correspondence shortcomings of multimodal llms. InICCV, 2025. 12 A Limitations Our study focuses on a specific visual primitive: preserving the identity of a selected path under nearby local co...
work page 2025
-
[42]
Start from the point labeled “S” and treat it as the current position
-
[43]
Select the dot that is visibly connected to the current position by a continuous line; use this line connection as the sole criterion for selection
-
[44]
Do not select a dot based on proximity or visual similarity to the path. For example, do not select a dot simply because it is close to the current position, or because it lies on a nearby parallel segment
-
[45]
S”, and you see a red dot connected to “S
Move to the selected dot, treat it as the new current position, and repeat until the path ends. Example: Suppose the current position is “S”, and you see a red dot connected to “S” by a continuous line, and a blue dot nearby but on a different path. Even if blue appears closer or lies in the same direction, you select red because it is connected by the li...
work page 2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.