UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?
Pith reviewed 2026-05-07 13:22 UTC · model grok-4.3
The pith
VLMs achieve moderate alignment with human gaze on user interfaces, with better results for longer viewing durations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments across 1,980 UI screenshots in four categories, nine state-of-the-art VLMs, three runs per model, and three viewing durations show moderate alignment between VLM-generated saliency maps and human eye-tracking ground truth on metrics of correlation coefficient, similarity, and Kullback-Leibler divergence. Alignment varies significantly across UI types and improves with longer viewing durations, indicating that VLMs more closely approximate exploratory gaze patterns than initial fixations.
What carries the argument
The zero-shot coordinate prediction pipeline in which a VLM outputs explicit gaze point coordinates that are then converted into continuous saliency maps via Gaussian blurring for direct numeric comparison to human eye-tracking data.
If this is right
- Alignment between VLMs and human gaze improves when longer viewing durations are allowed.
- The degree of alignment differs substantially across the four UI categories tested.
- VLMs appear to model sustained exploratory attention better than the first few fixations.
- Public release of all model predictions and evaluation code allows direct reproduction and extension of the comparisons.
Where Pith is reading between the lines
- The findings could support automated tools that flag UI designs likely to draw attention in unintended ways without running new eye-tracking sessions.
- Differences across UI types imply that attention models may need explicit handling of layout conventions such as navigation bars or mobile touch targets.
- If the gap to human performance persists even with longer durations, it points to a need for training data that explicitly includes gaze sequences rather than static images alone.
- The public dataset and predictions open the door to testing whether lightweight fine-tuning on a subset of the UEyes data would narrow the remaining alignment gap.
Load-bearing premise
That the zero-shot coordinate outputs from VLMs, once blurred with a fixed Gaussian kernel, produce saliency maps that can be compared fairly to human eye-tracking data without systematic distortion from the prompting format or the blurring step.
What would settle it
Re-running the full evaluation pipeline on the same images but replacing Gaussian blurring with an alternative density estimation method such as kernel density estimation and observing whether the reported CC, SIM, and KL scores shift by more than the variation already seen across models.
Figures
read the original abstract
Vision Language Models (VLMs) have demonstrated strong capabilities in understanding visual content, yet their ability to predict where humans look on user interfaces remains unexplored. We present UIGaze, a study investigating how closely VLMs can approximate human visual attention on user interfaces using real eye-tracking data. Using the UEyes dataset - comprising 1,980 UI screenshots across four categories (webpage, desktop, mobile, poster) with eye-tracking data from 62 participants - we evaluate nine state-of-the-art VLMs through a zero-shot coordinate prediction pipeline. Each model generates gaze point coordinates that are converted into saliency maps via Gaussian blurring and compared against ground truth using CC, SIM, and KL divergence. Our experiments (1,980 images x 9 models x 3 runs x 3 durations) reveal that VLMs achieve moderate alignment with human gaze patterns, with the degree of alignment varying significantly across UI types and improving with longer viewing durations - suggesting VLMs capture exploratory gaze patterns rather than initial fixations. All code, predictions, and evaluation results are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents UIGaze, an empirical benchmarking study that evaluates how closely nine state-of-the-art VLMs can approximate human visual attention on user interfaces. Using the UEyes dataset of 1,980 UI screenshots (webpage, desktop, mobile, poster categories) with eye-tracking data from 62 participants, the authors apply a zero-shot coordinate prediction pipeline: VLMs output gaze point coordinates that are converted to saliency maps via Gaussian blurring and compared to ground truth using CC, SIM, and KL divergence. Experiments (1,980 images × 9 models × 3 runs × 3 durations) show moderate alignment that varies significantly by UI type and improves with longer viewing durations, suggesting VLMs capture exploratory rather than initial fixation patterns. All code, predictions, and results are released publicly.
Significance. If the results hold, the work supplies a useful, reproducible benchmark for VLM gaze prediction in HCI settings, demonstrating moderate but category-dependent alignment with human attention and an interesting duration effect. Strengths include the sizable public dataset, evaluation across multiple models and runs, use of established saliency metrics, and full public release of artifacts, which directly supports reproducibility and follow-on research.
major comments (1)
- [Methodology section] Methodology section (zero-shot pipeline): The exact prompting templates for each VLM and the coordinate extraction procedure from model outputs are not fully specified in the text. Because the central claims of moderate alignment rest on this pipeline and its post-processing (Gaussian blurring to saliency maps), the lack of these details in the manuscript itself creates a reproducibility risk even though code is public; this should be added to the main text or an appendix.
minor comments (2)
- [Abstract] Abstract: The three viewing durations are referenced as improving alignment but their concrete values (e.g., 500 ms, 2 s, 5 s) are not stated; adding them would improve immediate clarity.
- [Results section] Results section: A compact summary table reporting mean CC/SIM/KL values per UI category (averaged across models) would help readers quickly compare the reported variation across webpage, desktop, mobile, and poster interfaces.
Simulated Author's Rebuttal
We thank the referee for their constructive review and recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Methodology section] Methodology section (zero-shot pipeline): The exact prompting templates for each VLM and the coordinate extraction procedure from model outputs are not fully specified in the text. Because the central claims of moderate alignment rest on this pipeline and its post-processing (Gaussian blurring to saliency maps), the lack of these details in the manuscript itself creates a reproducibility risk even though code is public; this should be added to the main text or an appendix.
Authors: We agree that including the exact prompting templates and coordinate extraction procedure in the manuscript itself strengthens reproducibility. Although the full implementation, prompts, and parsing code are already publicly released, we will add a dedicated subsection (or appendix) to the Methodology section that explicitly lists the prompting templates for each of the nine VLMs (including duration-specific variations), provides example model outputs, and details the coordinate extraction logic (e.g., parsing rules for (x, y) values). This addition will directly address the concern while preserving the paper's focus and length. revision: yes
Circularity Check
No significant circularity; empirical benchmarking study is self-contained
full rationale
The paper describes a direct empirical comparison of nine VLMs to human gaze data from the external UEyes dataset (1,980 images, 62 participants) via zero-shot coordinate prediction, Gaussian blurring to saliency maps, and standard metrics (CC, SIM, KL). No derivation chain, fitted parameters renamed as predictions, self-referential equations, or load-bearing self-citations exist; results rest on external ground-truth eye-tracking data and public artifacts rather than reducing to the paper's own inputs by construction. The study is a straightforward benchmarking exercise with no mathematical or definitional circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard saliency map metrics (CC, SIM, KL divergence) appropriately quantify alignment between predicted and human gaze distributions.
Reference graph
Works this paper leans on
- [1]
-
[2]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InarXiv preprint arXiv:2401.10935
work page internal anchor Pith review arXiv 2024
-
[3]
Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Pre- dicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.IEEE Transactions on Image Processing27, 10 (2018)
2018
-
[4]
Le-Khac, Wamiq Reyaz Para, Ankit Singh, and Sanath Narayan
Yasser Dahou, Ngoc Dung Huynh, Phuc H. Le-Khac, Wamiq Reyaz Para, Ankit Singh, and Sanath Narayan. 2025. Vision-Language Models Can’t See the Obvious. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
2025
-
[5]
Camilo Fosco, Vincent Casser, Amish Kumar Bedi, Peter O’Donovan, Aaron Hertz- mann, and Zoya Bylinskii. 2020. Predicting Visual Importance Across Graphic Design Types. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology
2020
-
[6]
Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. SALICON: Saliency in Context. InIEEE Conference on Computer Vision and Pattern Recogni- tion
2015
-
[7]
Leiva, Hamed Rezazadegan Tavakoli, Paul R.B
Yue Jiang, Luis A. Leiva, Hamed Rezazadegan Tavakoli, Paul R.B. Houssel, Julia Kylmälä, and Antti Oulasvirta. 2023. UEyes: Understanding Visual Saliency across User Interface Types. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems
2023
-
[8]
2012.A Benchmark of Computa- tional Models of Saliency to Predict Human Fixations
Tilke Judd, Frédo Durand, and Antonio Torralba. 2012.A Benchmark of Computa- tional Models of Saliency to Predict Human Fixations. Technical Report. MIT
2012
-
[9]
Matthias Kümmerer, Matthias Bethge, and Thomas S. A. Wallis. 2022. DeepGaze III: Modeling Free-Viewing Human Scanpaths with Deep Learning.Journal of Vision22, 5 (2022). 6
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.