arxiv: 2604.26352 · v1 · submitted 2026-04-29 · 💻 cs.HC

UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?

Min Song , Yoonseong Lee , Yeonhu Seo This is my paper

Pith reviewed 2026-05-07 13:22 UTC · model grok-4.3

classification 💻 cs.HC

keywords vision language modelsvisual attentionuser interfaceseye trackingsaliency mapszero-shot predictiongaze predictionUI evaluation

0 comments

The pith

VLMs achieve moderate alignment with human gaze on user interfaces, with better results for longer viewing durations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study examines whether vision-language models can predict where humans direct their attention on different user interfaces by running nine models on 1,980 real UI screenshots that include eye-tracking data from 62 participants. Each model receives the image in a zero-shot setup and outputs predicted gaze coordinates, which are turned into saliency maps through Gaussian blurring and scored against the human ground truth using standard similarity metrics. Results show moderate overall agreement that changes markedly depending on whether the interface is a webpage, desktop, mobile, or poster and that grows stronger when models receive more time to process the image. This suggests the models are picking up broader exploratory looking behavior rather than the very first fixations people make. The work matters because accurate automatic prediction of visual attention could reduce the need for expensive human eye-tracking studies when testing interface designs.

Core claim

Experiments across 1,980 UI screenshots in four categories, nine state-of-the-art VLMs, three runs per model, and three viewing durations show moderate alignment between VLM-generated saliency maps and human eye-tracking ground truth on metrics of correlation coefficient, similarity, and Kullback-Leibler divergence. Alignment varies significantly across UI types and improves with longer viewing durations, indicating that VLMs more closely approximate exploratory gaze patterns than initial fixations.

What carries the argument

The zero-shot coordinate prediction pipeline in which a VLM outputs explicit gaze point coordinates that are then converted into continuous saliency maps via Gaussian blurring for direct numeric comparison to human eye-tracking data.

If this is right

Alignment between VLMs and human gaze improves when longer viewing durations are allowed.
The degree of alignment differs substantially across the four UI categories tested.
VLMs appear to model sustained exploratory attention better than the first few fixations.
Public release of all model predictions and evaluation code allows direct reproduction and extension of the comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The findings could support automated tools that flag UI designs likely to draw attention in unintended ways without running new eye-tracking sessions.
Differences across UI types imply that attention models may need explicit handling of layout conventions such as navigation bars or mobile touch targets.
If the gap to human performance persists even with longer durations, it points to a need for training data that explicitly includes gaze sequences rather than static images alone.
The public dataset and predictions open the door to testing whether lightweight fine-tuning on a subset of the UEyes data would narrow the remaining alignment gap.

Load-bearing premise

That the zero-shot coordinate outputs from VLMs, once blurred with a fixed Gaussian kernel, produce saliency maps that can be compared fairly to human eye-tracking data without systematic distortion from the prompting format or the blurring step.

What would settle it

Re-running the full evaluation pipeline on the same images but replacing Gaussian blurring with an alternative density estimation method such as kernel density estimation and observing whether the reported CC, SIM, and KL scores shift by more than the variation already seen across models.

Figures

Figures reproduced from arXiv: 2604.26352 by Min Song, Yeonhu Seo, Yoonseong Lee.

**Figure 1.** Figure 1: Best (top) and worst (bottom) saliency prediction cases for GPT-5.4 at 7s duration. Each row shows the original UI view at source ↗

read the original abstract

Vision Language Models (VLMs) have demonstrated strong capabilities in understanding visual content, yet their ability to predict where humans look on user interfaces remains unexplored. We present UIGaze, a study investigating how closely VLMs can approximate human visual attention on user interfaces using real eye-tracking data. Using the UEyes dataset - comprising 1,980 UI screenshots across four categories (webpage, desktop, mobile, poster) with eye-tracking data from 62 participants - we evaluate nine state-of-the-art VLMs through a zero-shot coordinate prediction pipeline. Each model generates gaze point coordinates that are converted into saliency maps via Gaussian blurring and compared against ground truth using CC, SIM, and KL divergence. Our experiments (1,980 images x 9 models x 3 runs x 3 durations) reveal that VLMs achieve moderate alignment with human gaze patterns, with the degree of alignment varying significantly across UI types and improving with longer viewing durations - suggesting VLMs capture exploratory gaze patterns rather than initial fixations. All code, predictions, and evaluation results are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UIGaze runs a decent-scale benchmark of VLMs on UI gaze prediction against real eye data and gets moderate alignment that improves with time.

read the letter

The main point is that current VLMs can produce gaze points on UI screenshots that line up moderately with human eye-tracking, better on some UI types and with longer simulated viewing times. They took the UEyes set of 1980 images, prompted nine models zero-shot for coordinates, turned those into saliency maps with Gaussian blur, and scored them with CC, SIM, and KL against the ground truth from 62 participants. The scale (images times models times runs times durations) and the public code plus predictions are the parts that actually move the needle here. It is the first time this exact combination has been tried at this size for VLMs on UIs, and the data show the models are closer to exploratory patterns than to the very first fixations. That reading of the results holds up from the numbers they report. The work is straightforward empirical benchmarking with no parameter fitting or self-referential math, so the circularity burden is low and the claims rest on direct comparison to external data. The dataset is public and established, the metrics are standard, and the artifacts let others check the numbers. One soft spot is that the exact prompt wording and the coordinate extraction step from the model outputs are not spelled out in full detail. That leaves some room for implementation differences, and the blur parameters could shift the saliency maps a bit, but the public release makes it possible to test sensitivity directly. It is not a load-bearing flaw for this kind of study. This paper is useful for people who build or evaluate VLMs for interface tasks or who need baseline numbers on attention modeling. It is not claiming new theory or broad impact, just solid benchmark data. I would send it to peer review because the experiment is large enough, the artifacts are open, and the central comparison is reproducible enough to be worth referee time.

Referee Report

1 major / 2 minor

Summary. The manuscript presents UIGaze, an empirical benchmarking study that evaluates how closely nine state-of-the-art VLMs can approximate human visual attention on user interfaces. Using the UEyes dataset of 1,980 UI screenshots (webpage, desktop, mobile, poster categories) with eye-tracking data from 62 participants, the authors apply a zero-shot coordinate prediction pipeline: VLMs output gaze point coordinates that are converted to saliency maps via Gaussian blurring and compared to ground truth using CC, SIM, and KL divergence. Experiments (1,980 images × 9 models × 3 runs × 3 durations) show moderate alignment that varies significantly by UI type and improves with longer viewing durations, suggesting VLMs capture exploratory rather than initial fixation patterns. All code, predictions, and results are released publicly.

Significance. If the results hold, the work supplies a useful, reproducible benchmark for VLM gaze prediction in HCI settings, demonstrating moderate but category-dependent alignment with human attention and an interesting duration effect. Strengths include the sizable public dataset, evaluation across multiple models and runs, use of established saliency metrics, and full public release of artifacts, which directly supports reproducibility and follow-on research.

major comments (1)

[Methodology section] Methodology section (zero-shot pipeline): The exact prompting templates for each VLM and the coordinate extraction procedure from model outputs are not fully specified in the text. Because the central claims of moderate alignment rest on this pipeline and its post-processing (Gaussian blurring to saliency maps), the lack of these details in the manuscript itself creates a reproducibility risk even though code is public; this should be added to the main text or an appendix.

minor comments (2)

[Abstract] Abstract: The three viewing durations are referenced as improving alignment but their concrete values (e.g., 500 ms, 2 s, 5 s) are not stated; adding them would improve immediate clarity.
[Results section] Results section: A compact summary table reporting mean CC/SIM/KL values per UI category (averaged across models) would help readers quickly compare the reported variation across webpage, desktop, mobile, and poster interfaces.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Methodology section] Methodology section (zero-shot pipeline): The exact prompting templates for each VLM and the coordinate extraction procedure from model outputs are not fully specified in the text. Because the central claims of moderate alignment rest on this pipeline and its post-processing (Gaussian blurring to saliency maps), the lack of these details in the manuscript itself creates a reproducibility risk even though code is public; this should be added to the main text or an appendix.

Authors: We agree that including the exact prompting templates and coordinate extraction procedure in the manuscript itself strengthens reproducibility. Although the full implementation, prompts, and parsing code are already publicly released, we will add a dedicated subsection (or appendix) to the Methodology section that explicitly lists the prompting templates for each of the nine VLMs (including duration-specific variations), provides example model outputs, and details the coordinate extraction logic (e.g., parsing rules for (x, y) values). This addition will directly address the concern while preserving the paper's focus and length. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarking study is self-contained

full rationale

The paper describes a direct empirical comparison of nine VLMs to human gaze data from the external UEyes dataset (1,980 images, 62 participants) via zero-shot coordinate prediction, Gaussian blurring to saliency maps, and standard metrics (CC, SIM, KL). No derivation chain, fitted parameters renamed as predictions, self-referential equations, or load-bearing self-citations exist; results rest on external ground-truth eye-tracking data and public artifacts rather than reducing to the paper's own inputs by construction. The study is a straightforward benchmarking exercise with no mathematical or definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions about saliency evaluation metrics and the validity of zero-shot prompting for attention approximation, with no free parameters fitted to the target data and no new entities postulated.

axioms (1)

domain assumption Standard saliency map metrics (CC, SIM, KL divergence) appropriately quantify alignment between predicted and human gaze distributions.
Invoked when converting model outputs to saliency maps and computing scores against ground truth.

pith-pipeline@v0.9.0 · 5489 in / 1507 out tokens · 74139 ms · 2026-05-07T13:22:08.392997+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Ali Borji and Laurent Itti. 2015. CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research.arXiv preprint arXiv:1505.03581(2015)

work page arXiv 2015
[2]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InarXiv preprint arXiv:2401.10935

work page internal anchor Pith review arXiv 2024
[3]

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Pre- dicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.IEEE Transactions on Image Processing27, 10 (2018)

2018
[4]

Le-Khac, Wamiq Reyaz Para, Ankit Singh, and Sanath Narayan

Yasser Dahou, Ngoc Dung Huynh, Phuc H. Le-Khac, Wamiq Reyaz Para, Ankit Singh, and Sanath Narayan. 2025. Vision-Language Models Can’t See the Obvious. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2025
[5]

Camilo Fosco, Vincent Casser, Amish Kumar Bedi, Peter O’Donovan, Aaron Hertz- mann, and Zoya Bylinskii. 2020. Predicting Visual Importance Across Graphic Design Types. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology

2020
[6]

Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. SALICON: Saliency in Context. InIEEE Conference on Computer Vision and Pattern Recogni- tion

2015
[7]

Leiva, Hamed Rezazadegan Tavakoli, Paul R.B

Yue Jiang, Luis A. Leiva, Hamed Rezazadegan Tavakoli, Paul R.B. Houssel, Julia Kylmälä, and Antti Oulasvirta. 2023. UEyes: Understanding Visual Saliency across User Interface Types. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems

2023
[8]

2012.A Benchmark of Computa- tional Models of Saliency to Predict Human Fixations

Tilke Judd, Frédo Durand, and Antonio Torralba. 2012.A Benchmark of Computa- tional Models of Saliency to Predict Human Fixations. Technical Report. MIT

2012
[9]

Matthias Kümmerer, Matthias Bethge, and Thomas S. A. Wallis. 2022. DeepGaze III: Modeling Free-Viewing Human Scanpaths with Deep Learning.Journal of Vision22, 5 (2022). 6

2022