pith. sign in

arxiv: 2510.22102 · v2 · submitted 2025-10-25 · 💻 cs.CV · cs.AI· cs.CL

Mitigating Coordinate Prediction Bias from Positional Encoding Failures

Pith reviewed 2026-05-18 05:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multimodal large language modelscoordinate predictionpositional encodinginference-time correctionlocalization accuracyvision-language modelsfinite-state machinescreen element detection
0
0 comments X

The pith

Shuffling visual positional encodings during inference corrects predictable directional biases in coordinate predictions of multimodal large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that visual positional encoding failures in multimodal large language models at high resolutions produce consistent directional biases in coordinate outputs rather than random errors. These biases occur because the models default to internal spatial priors when grounding signals weaken. The authors introduce Vision-PE Shuffle Guidance, a training-free method that shuffles the encodings to surface the unconditioned tendencies and then applies a lightweight finite-state machine to adjust digit decoding and steer the predictions. This yields measurable gains in localization accuracy on the ScreenSpot-Pro benchmark across model scales. Readers would care because it supplies a simple, no-retraining fix for a practical weakness in current vision-language systems.

Core claim

The central claim is that visual positional encoding failures do not produce random noise but instead trigger predictable directional biases in coordinate prediction, and that these biases can be rectified at inference time by shuffling the encodings to isolate position-unconditioned tendencies and then using a finite-state machine to correct the digit sequence, resulting in improved localization accuracy without any model retraining.

What carries the argument

Vision-PE Shuffle Guidance (VPSG), a method that shuffles visual positional encodings to isolate position-unconditioned tendencies and steers coordinate digit decoding through a lightweight finite-state machine.

If this is right

  • Rectifies coordinate drift and raises localization accuracy on benchmarks such as ScreenSpot-Pro.
  • Delivers consistent gains across different model scales.
  • Functions entirely at inference time with no retraining or parameter updates.
  • Shows that the biases are directional and predictable, allowing targeted correction instead of general noise reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current models appear to fall back on learned spatial priors when positional signals degrade, pointing to a broader architectural vulnerability at high resolutions.
  • The same shuffling-plus-correction idea could apply to other spatial grounding tasks such as bounding-box regression or scene-text localization.
  • Better native positional encodings might reduce the need for post-hoc fixes by limiting drift before it appears in the output tokens.

Load-bearing premise

That shuffling visual positional encodings isolates position-unconditioned tendencies that can be reliably used by a finite-state machine to steer digit decoding and correct the observed directional biases.

What would settle it

Running VPSG on multiple models and inputs and finding either no accuracy gain or that the coordinate errors remain random rather than directional and correctable.

Figures

Figures reproduced from arXiv: 2510.22102 by Jing Tang, Kai Han, Xingjian Tao, Yihong Luo, Yiwei Wang, Yujun Cai.

Figure 1
Figure 1. Figure 1: Effect of shuffling visual positional encodings: removing spatial conditioning causes the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Causal view of coordinate prediction. Image con [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: summarizes the distance statistics. Across both Qwen2.5-VL-3B and Qwen2.5-VL-7B, the diagonal-normalized average pairwise distance under shuffled positional encodings is consistently small ( ˜d ≈ 0.16), whereas the normal-PE condition exhibits substantially larger dispersion ( ˜d ≈ 0.40–0.44). This substantial gap, far exceeding the baseline dispersion of random uniform points, confirms that when positiona… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative example of VPSG on a Screenspot-Pro case. The base model with normal [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An image case from dataset ScreenSpot-Pro (ppt [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, precise coordinate prediction remains a significant challenge, particularly as high-resolution inputs cause visual positional encodings (VPEs) to degrade. We demonstrate that these encoding failures do not result in random noise but instead trigger predictable, directional biases, suggesting that models default to internal spatial priors when grounding signals are weak. To counteract this, we introduce Vision-PE Shuffle Guidance (VPSG), a training-free, inference-time correction method. VPSG isolates position-unconditioned tendencies by shuffling VPEs and utilizes this negative evidence to steer digit decoding through a lightweight finite-state machine. Evaluation on the ScreenSpot-Pro benchmark confirms that VPSG effectively rectifies coordinate drift, yielding consistent improvements in localization accuracy across various model scales without any retraining. Our code is available at https://github.com/taoxj2001/VPSG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that visual positional encoding (VPE) failures in Multimodal Large Language Models (MLLMs) produce predictable directional biases in coordinate prediction rather than random noise, as models revert to internal spatial priors under weak grounding signals. It introduces Vision-PE Shuffle Guidance (VPSG), a training-free inference-time method that shuffles VPEs to isolate position-unconditioned tendencies, feeds the resulting negative evidence into a lightweight finite-state machine to steer digit decoding, and thereby corrects coordinate outputs. Evaluation is reported on the ScreenSpot-Pro benchmark, with the claim of consistent localization accuracy gains across model scales and no retraining required.

Significance. If the central claims hold after addressing the noted concerns, the work offers a practical, zero-training correction for a known weakness in high-resolution coordinate grounding tasks common to GUI agents and visual navigation. The training-free design and code release are clear strengths that would aid adoption. The directional-bias observation is potentially useful, but its significance is limited by the current lack of quantitative support and verification that the shuffling step cleanly isolates the intended signal.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the claim of 'consistent improvements in localization accuracy across various model scales' is asserted without any reported accuracy deltas, error bars, baseline comparisons, number of test samples, or ablation tables; this absence makes it impossible to judge effect size or robustness.
  2. [§3] §3 (Method): the core assumption that shuffling VPEs isolates position-unconditioned tendencies whose digit statistics reflect only internal spatial priors is load-bearing for the directional-correction guarantee, yet no ablation is described that holds tokenization, attention patterns, and cross-modal alignment fixed while randomizing only the positional component; confounding from other disruptions would invalidate the FSM steering.
minor comments (2)
  1. [§3] The finite-state machine transition rules and state definitions are described only at a high level; adding pseudocode or a small diagram would improve reproducibility.
  2. [§1] Notation for 'VPE' and 'VPSG' is introduced without an explicit glossary or first-use expansion in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing where the manuscript lacked sufficient detail and indicating revisions to strengthen the presentation of results and methodological validation.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of 'consistent improvements in localization accuracy across various model scales' is asserted without any reported accuracy deltas, error bars, baseline comparisons, number of test samples, or ablation tables; this absence makes it impossible to judge effect size or robustness.

    Authors: We agree that the original abstract and §4 provided only a high-level claim without quantitative support. In the revised manuscript we have added Table 1 in §4 reporting per-model accuracies on ScreenSpot-Pro (1,200 samples), baseline vs. VPSG results, deltas (e.g., +4.8% average), and standard deviations from three independent runs as error bars. Baseline comparisons and component ablations are now included to allow assessment of effect size and robustness. revision: yes

  2. Referee: [§3] §3 (Method): the core assumption that shuffling VPEs isolates position-unconditioned tendencies whose digit statistics reflect only internal spatial priors is load-bearing for the directional-correction guarantee, yet no ablation is described that holds tokenization, attention patterns, and cross-modal alignment fixed while randomizing only the positional component; confounding from other disruptions would invalidate the FSM steering.

    Authors: This concern is valid and we have strengthened the justification. The original method applies shuffling exclusively to the VPE tensor while preserving visual token content, tokenization, and cross-modal attention weights. To verify isolation we have added an ablation in revised §4.3 that compares VPE shuffling against controlled alternatives (random feature dropout and attention masking) under otherwise identical conditions. Only VPE shuffling produces the expected shift in digit-output statistics toward internal priors, supporting the assumption while acknowledging that further controls could be explored. revision: yes

Circularity Check

0 steps flagged

No significant circularity: VPSG uses external shuffling for negative evidence and a lightweight FSM without reducing claims to fitted inputs or self-referential definitions.

full rationale

The paper's derivation chain introduces VPSG as a training-free inference-time method that shuffles visual positional encodings to obtain negative evidence of position-unconditioned tendencies, then steers digit decoding via a finite-state machine. This process is presented as independent of retraining and relies on observable outputs from the shuffling operation rather than any parameter fitted to the target coordinate predictions. No equations, self-citations, or ansatzes in the abstract or described method reduce the bias correction to a construction equivalent to its inputs; the central claim of rectifying directional biases is supported by external benchmark evaluation on ScreenSpot-Pro and does not invoke uniqueness theorems or prior author work as load-bearing justification. The derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that positional encoding failures produce predictable directional biases rather than random noise, and that these biases can be isolated by shuffling without introducing new artifacts.

axioms (2)
  • domain assumption Visual positional encoding failures in high-resolution inputs produce predictable, directional biases rather than random noise.
    Stated directly in the abstract as the key observation motivating the method.
  • domain assumption Shuffling VPEs isolates position-unconditioned tendencies that serve as usable negative evidence for correction.
    Core premise of the VPSG procedure described in the abstract.

pith-pipeline@v0.9.0 · 5696 in / 1341 out tokens · 26830 ms · 2026-05-18T05:08:24.679611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  3. [3]

    Qwen2.5-VL Technical Report

    URLhttps://arxiv.org/abs/2502.13923. Xi Chen, Shiyang Zhou, Muqi Huang, Jiaxu Feng, Yun Xiong, Kun Zhou, Biao Yang, Yuhui Zhang, Huishuai Bao, Sijia Peng, et al. A 2d semantic-aware position encoding for vision transformers. arXiv preprint arXiv:2505.09466, 2025a. Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, and Yuexian Zou. Advancing gen...

  4. [4]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    URLhttps: //lmsys.org/blog/2023-03-30-vicuna/. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  5. [5]

    Mul- timodal web navigation with instruction-finetuned foundation models

    Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, and Izzeddin Gur. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854,

  6. [6]

    Insights into llm long- context failures: when transformers know but don’t tell

    Muhan Gao, TaiMing Lu, Kuai Yu, Adam Byerly, and Daniel Khashabi. Insights into llm long- context failures: when transformers know but don’t tell. InFindings of the Association for Com- putational Linguistics: EMNLP 2024, pp. 7611–7625,

  7. [7]

    V2PE: improving multimodal long-context capability of vision-language models with variable visual position encoding

    Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2pe: Improv- ing multimodal long-context capability of vision-language models with variable visual position encoding.arXiv preprint arXiv:2412.09616,

  8. [8]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  9. [9]

    Found in the mid- dle: Calibrating positional attention bias improves long context utilization.arXiv preprint arXiv:2406.16008,

    Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T Le, Abhishek Ku- mar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, et al. Found in the mid- dle: Calibrating positional attention bias improves long context utilization.arXiv preprint arXiv:2406.16008,

  10. [10]

    Fourier position embedding: Enhancing attention’s periodic extension for length generalization.arXiv preprint arXiv:2412.17739,

    Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: Enhancing attention’s periodic extension for length generalization.arXiv preprint arXiv:2412.17739,

  11. [11]

    2025 , publisher =

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981,

  12. [12]

    Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465,

  13. [13]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023a

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023a. Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, and Chunyuan Li. Llava-plus: Learning to use tools for creatin...

  14. [14]

    Accessed: 2023-05-05

    URLwww.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05. OpenAI. OpenAI: Introducing ChatGPT,

  15. [15]

    Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025

    Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Beyond semantics: Rediscovering spatial awareness in vision-language models.arXiv preprint arXiv:2503.17349,

  16. [16]

    Understanding gui agent localization biases through logit sharpness.arXiv preprint arXiv:2506.15425,

    Xingjian Tao, Yiwei Wang, Yujun Cai, Zhicheng Yang, and Jing Tang. Understanding gui agent localization biases through logit sharpness.arXiv preprint arXiv:2506.15425,

  17. [17]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  18. [18]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-...

  19. [19]

    Should we rely on entity mentions for relation extrac- tion? debiasing relation extraction with counterfactual analysis.arXiv preprint arXiv:2205.03784,

    Yiwei Wang, Muhao Chen, Wenxuan Zhou, Yujun Cai, Yuxuan Liang, Dayiheng Liu, Baosong Yang, Juncheng Liu, and Bryan Hooi. Should we rely on entity mentions for relation extrac- tion? debiasing relation extraction with counterfactual analysis.arXiv preprint arXiv:2205.03784,

  20. [20]

    Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang

    URLhttps://arxiv.org/abs/2407.01100. Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang. Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning. arXiv preprint arXiv:2507.00008,

  21. [21]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for gener- alist gui agents.arXiv preprint arXiv:2410.23218,

  22. [22]

    Qwen2.5-Omni Technical Report

    URLhttps://arxiv.org/abs/2503.20215. Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256,

  23. [23]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    12 Preprint Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Li- juan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 9(1):1,

  24. [24]

    Hao Yu, Tangyu Jiang, Shuning Jia, Shannan Yan, Shunning Liu, Haolong Qian, Guanghao Li, Shuting Dong, and Chun Yuan

    URLhttps://arxiv.org/abs/2402.16617. Hao Yu, Tangyu Jiang, Shuning Jia, Shannan Yan, Shunning Liu, Haolong Qian, Guanghao Li, Shuting Dong, and Chun Yuan. Comrope: Scalable and robust rotary position embedding pa- rameterized by trainable commuting angle matrices. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 4508–4517,

  25. [25]

    Beyond llava-hd: Diving into high-resolution large multimodal models,

    Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, and Rong Jin. Beyond llava-hd: Diving into high-resolution large multimodal models.arXiv preprint arXiv:2406.08487,

  26. [26]

    Switch to polar coordinates on the first quadrant:u=rcosθ, v=rsinθwithθ∈[0, π/2]and Jacobianr dr dθ

    Hence E[R] = Z 1 0 Z 1 0 p u2 +v 2 4(1−u)(1−v)du dv. Switch to polar coordinates on the first quadrant:u=rcosθ, v=rsinθwithθ∈[0, π/2]and Jacobianr dr dθ. The square boundary imposes 0≤r≤r max(θ) = min{1/cosθ,1/sinθ}. Noting √ u2 +v 2 =rand(1−u)(1−v) = (1−rcosθ)(1−rsinθ), we obtain E[R] = 4 Z π/2 0 Z rmax(θ) 0 (1−rcosθ)(1−rsinθ)r 2 dr dθ. Split atθ=π/4, wh...

  27. [27]

    This completes the proof.□ Remark (application to images)

    15 . This completes the proof.□ Remark (application to images). The constantµ □ is the dispersion benchmark for a unit square. For arbitrary image sizes(W, H), either (i) anisotropically rescale coordinates to[0,1] 2 before com- puting distances and compare toµ □, or (ii) form a per-image Monte-Carlo null by sampling i.i.d. uniform points from[0, W]×[0, H...

  28. [28]

    1” and “0

    The ground-truth bounding box center is[659,857]. The base model without guidance predicts [1024,856], while VPSG successfully outputs the correct[659,857]. Table 5 lists the top-10 logits probabilities at each decoding step. At the earliestx-digit steps, the uncorrected model shows a strong bias toward larger numbers (e.g., tokens “1” and “0” dominate), ...