pith. sign in

arxiv: 2607.02083 · v1 · pith:UQH3WNZZnew · submitted 2026-07-02 · 💻 cs.CV

DeepGaze3.5-VL: Modeling Scanpaths via Autoregressive Token Prediction

Pith reviewed 2026-07-03 15:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords scanpath predictionvisual attentionvision-language modelsautoregressive modelingeye trackinginformation gaingenerative modeling
0
0 comments X

The pith

Scanpath prediction reduces to autoregressive token prediction by mapping fixations to text tokens inside pretrained vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that human eye-movement sequences can be modeled by converting fixation coordinates into discrete tokens drawn from a text vocabulary and feeding those tokens to an existing vision-language model. This turns a previously specialized modeling problem into ordinary next-token prediction, so that global factors such as viewer identity or task goals enter through ordinary prompting rather than new architectural modules. The resulting model reports 2.18 bits of information gain on MIT1003, a 46 percent gain over the prior DeepGaze III baseline even when both use the same high-capacity vision encoder. The framing also supplies exact per-fixation likelihoods and supports generative interventions on fixation durations that recover known oculomotor regularities from data alone.

Core claim

By mapping continuous fixation coordinates into a fixed text vocabulary and treating scanpath generation as autoregressive token prediction inside a pretrained vision-language model, DeepGaze3.5-VL obtains state-of-the-art performance across datasets while allowing flexible conditioning on viewer identity, task instructions, and per-fixation attributes through standard prompting.

What carries the argument

Autoregressive token prediction on discretized fixation coordinates inside a vision-language model, which supplies exact per-fixation log-likelihoods equivalent to information gain.

If this is right

  • Prompting with viewer identity directly captures personalized attention biases without new parameters.
  • Natural-language task descriptions such as visual-search instructions can be added at inference time.
  • Fixation durations and other per-fixation attributes can be predicted jointly with location tokens.
  • The generative model supports controlled in-silico interventions on fixation timing that recover known oculomotor phenomena.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tokenization approach could be applied to other sequential visual-motor behaviors such as pointing or grasping trajectories.
  • Exact per-token likelihoods open the door to uncertainty-aware attention models for interface evaluation.
  • Scaling the underlying vision-language model on larger eye-tracking corpora may further widen the performance gap.

Load-bearing premise

Discretizing continuous fixation locations into a fixed text vocabulary keeps enough spatial and temporal structure for accurate scanpath prediction without task-specific architectural changes.

What would settle it

A controlled experiment in which an otherwise identical model using continuous coordinate regression or a non-VLM backbone yields equal or higher information gain on the same test sets.

Figures

Figures reproduced from arXiv: 2607.02083 by Matthias Bethge, Matthias K\"ummerer, Susmit Agrawal.

Figure 1
Figure 1. Figure 1: Multimodal capability predicts scanpath quality. Information Gain of LoRA-tuned models on MIT1003 versus MMMU score across four VLM families (LLaVA-Next, Gemma 3, InternVL 3.5, Qwen3.5) and multiple scales, with leave￾one-out correlation envelopes. The strong correlation (ρ = 0.93) suggests that the rep￾resentations underlying broad visual under￾standing are the same ones needed to pre￾dict human attention… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic Diagram of our Inference Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture matters more than raw scale. SmolVLM2 models re￾main flat regardless of scale; DeepGaze III improves with better vision encoders; other VLM families show steep gains with size but with markedly different efficiency. In￾ternVL 3.5 reaches 2 bits at 4B parameters, while LLaVA-Next requires 13B for 1.6 bits. Out-of-Domain Generalization. The “DeepGaze3.5-VL (LODO)” row in [PITH_FULL_IMAGE:figure… view at source ↗
Figure 4
Figure 4. Figure 4: Per-fixation surprise over time. Ground-truth surprise (NLL, bits) remains stable; surprise for shuffled sequences be￾gins near ground truth but diverges later: the model is sensitive to sequential depen￾dencies. Random sequences sit above the uniform baseline, as the model concentrates mass on scene-relevant regions, inflating surprise for uniformly sampled fixations. We now analyze how much of the model’… view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of spatial and se￾quential information across scan￾paths. To isolate contributions of spa￾tial and sequential knowledge, we compare three models. The Spatial-only model decays to near-zero predictability over time, confirming late-stage viewing requires or￾dinal knowledge. The History-invariant model learns population-level ordinal ten￾dencies, yielding a stable saliency advan￾tage. The Full mode… view at source ↗
Figure 7
Figure 7. Figure 7: Counterfactual duration intervention. Single fixation transition with the conditioned source-fixation duration swept from 50 ms to 600 ms; the ground-truth duration is marked in red. All other inputs (image, spatial history, model weights) are identical across panels. The model concentrates mass sharply at 50–100 ms and broadens monotonically beyond 250 ms. tory, maybe even with an increase towards the end… view at source ↗
Figure 6
Figure 6. Figure 6: Duration helps predict early fixations. The value of conditioning on du￾ration is asymmetric, yielding the highest gains for short fixations, and some improve￾ment in predictability of very late fixations. investigate whether VLMs can lever￾age this temporal signal to improve spatial predictions, we introduce du￾ration conditioning. We adapt our to￾kenization scheme by augmenting the spatial coordinate for… view at source ↗
Figure 8
Figure 8. Figure 8: Subject IDs modulate spatial distributions. Predicted density over the first post-central fixation for three subject identifiers on an image from MIT1003. Dif￾ferences are purely due to the subject token. 3.5 Subject Conditioning A key advantage of our formulation is the ease with which additional con￾ditioning variables can be incorpo￾rated. To demonstrate this, we ex￾tend our framework to model indi￾vidu… view at source ↗
Figure 9
Figure 9. Figure 9: Per-fixation IG in Task vs Free￾Viewing. Search IG is high initially: view￾ing patterns are established early in search. The search-conditioned model achieves an aggregate IG of 3.27 bits/fix, sub￾stantially exceeding the free-viewing model’s 2.15 bits/fix on the same COCO images. The task effect is al￾most entirely mediated by target pres￾ence. When the search target appears in the image, IG rises to 4.23… view at source ↗
Figure 10
Figure 10. Figure 10: Task changes predicted density without architectural changes. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Few-shot dataset adap￾tation. Fine-tuning a rank-1 LoRA adapter on a held-out dataset (MIT1003) rapidly closes the gap to the fully in-distribution upper bound, confirming that the tuned model requires only minimal calibration to new viewing conditions. While out-of-distribution performance is inherently strong (as shown in the LODO benchmarks of the main text), models can be further adapted to the viewin… view at source ↗
Figure 12
Figure 12. Figure 12: Subject-conditioned predictions diverge over successive fixations. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual Search Performance by Target. Left: Information gain across cat￾egories, including zero-shot generalization to unseen targets. Right: Breakdown of IG when the target is present vs. absent in the scene. Unseen target categories (e.g., bottle, chair, knife) achieve comparable IG (around 3.0–3.3 bits) to seen targets, demonstrat￾ing the VLM’s ability to leverage open-vocabulary semantic knowledge. As … view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison of generated scanpaths. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison of distractor suppression. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Free-viewing scanpath prediction. The model predicts (x, y) fixation coordi￾nates only, without duration. JSON keys in red, string values in blue [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: A single training instance for scanpath prediction fine-tuning. JSON keys in red, string values in blue [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Free-viewing scanpath prediction with fixation durations. The model predicts (x, y, t) tuples, where t is dwell time in milliseconds [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Target-directed visual search scanpath prediction. The search target (here: clock) is specified in the prompt; the number of fixations varies per trial reflecting actual search termination [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
read the original abstract

Understanding human visual attention on a scene over time has applications in domains such as interface design and inferring cognitive states. Modeling visual scanpaths has historically relied on specialized architectures with hand-crafted priors. While these architectures can model fixation sequences, their rigid structural biases restrict easy extendability and flexible conditioning. For instance, integrating task-specific instructions or adapting to distinct viewer identities requires custom, disjoint architectural additions. We frame scanpath prediction purely as a discrete sequence modeling task. By mapping coordinates into a text vocabulary, we leverage the pretrained representations of Vision-Language Models. This framing absorbs diverse factors of variation: simple prompting allows for global conditioning, such as providing viewer identities to capture personalized biases, or task-specific objectives like visual search. The framework can also integrate per-fixation attributes, such as individual fixation durations, alongside spatial locations. The autoregressive alignment enables the scalable, exact computation of per-fixation log-likelihoods, directly equivalent to the commonly used Information Gain (IG) metric. Our model, DeepGaze3.5-VL, establishes a new state-of-the-art across multiple datasets, achieving 2.18 bits of IG on MIT1003, a 46% improvement over DeepGaze III. This advantage persists even when baselines use identical high-capacity vision encoders. Beyond predictive performance, our generative framework serves as a powerful computational tool for direct behavioral interventions, allowing for controlled in-silico simulations that would be experimentally difficult or impossible to conduct in vivo. We demonstrate this ability by performing controlled interventions on the durations of pre-saccadic fixations, recovering known oculomotor phenomena purely from data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper frames scanpath modeling as autoregressive discrete sequence prediction by mapping continuous fixation coordinates to a text vocabulary and leveraging pretrained vision-language models. It claims this yields a new SOTA of 2.18 bits IG on MIT1003 (46% improvement over DeepGaze III) that holds with matched high-capacity encoders, that per-fixation autoregressive log-likelihoods are mathematically identical to the standard IG metric, and that simple prompting enables flexible conditioning on viewer identity or task while also supporting generative interventions on fixation durations.

Significance. If the coordinate-to-token mapping preserves spatial/temporal fidelity and the IG equivalence is rigorously verified, the approach would be significant for removing the need for hand-crafted architectural priors in scanpath models and enabling prompt-based extensions. The generative simulation capability is a clear strength not present in prior discriminative models.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (coordinate-to-token mapping): the manuscript provides no equation, pseudocode, or parameter for the discretization function, vocabulary size, or spatial resolution (pixels per token). This is load-bearing for the central claim that the discrete space preserves the structure needed for accurate per-fixation likelihoods and for the assertion that reported IG values are comparable to continuous baselines.
  2. [Abstract and §4] Abstract and §4 (IG equivalence): the statement that autoregressive log-likelihood is 'directly equivalent' to the standard IG metric is asserted without derivation, proof of measure equivalence after discretization, or explicit verification that the token space matches the evaluation metric's support. This directly affects whether the 2.18-bit figure and 46% improvement can be interpreted as a genuine advance.
  3. [§5] §5 (results): the SOTA claim and the statement that the advantage 'persists even when baselines use identical high-capacity vision encoders' are presented without reference to specific tables, error bars, cross-validation splits, or the exact encoder-matching ablation protocol, leaving the robustness of the matched-encoder result unassessable.
minor comments (1)
  1. [§3] Notation for the token vocabulary and the conditioning prompt format should be introduced with explicit symbols in §3 to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and details.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (coordinate-to-token mapping): the manuscript provides no equation, pseudocode, or parameter for the discretization function, vocabulary size, or spatial resolution (pixels per token). This is load-bearing for the central claim that the discrete space preserves the structure needed for accurate per-fixation likelihoods and for the assertion that reported IG values are comparable to continuous baselines.

    Authors: We agree that the discretization details are essential for reproducibility and for validating that the discrete token space preserves the necessary spatial and temporal structure. Although §3 describes the high-level mapping of fixation coordinates to a text vocabulary, we did not provide an explicit equation, pseudocode, vocabulary size, or spatial resolution parameter. In the revised manuscript we will add these elements, including the precise discretization function and resolution (pixels per token), to substantiate the fidelity claims. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4 (IG equivalence): the statement that autoregressive log-likelihood is 'directly equivalent' to the standard IG metric is asserted without derivation, proof of measure equivalence after discretization, or explicit verification that the token space matches the evaluation metric's support. This directly affects whether the 2.18-bit figure and 46% improvement can be interpreted as a genuine advance.

    Authors: The equivalence arises because the autoregressive factorization of the joint token probability yields per-fixation log-likelihoods that match the binned information-gain computation once coordinates are discretized. We acknowledge that the manuscript asserts this without a formal derivation or explicit verification of measure equivalence. In the revision we will insert a mathematical derivation in §4 demonstrating the identity after discretization and confirming that the token support aligns with the evaluation bins used for the reported IG values. revision: yes

  3. Referee: [§5] §5 (results): the SOTA claim and the statement that the advantage 'persists even when baselines use identical high-capacity vision encoders' are presented without reference to specific tables, error bars, cross-validation splits, or the exact encoder-matching ablation protocol, leaving the robustness of the matched-encoder result unassessable.

    Authors: The SOTA results and matched-encoder comparisons are supported by the quantitative tables and ablation experiments in §5. We agree that the text should explicitly reference these elements. In the revision we will add direct citations to the relevant tables, include error bars, specify the cross-validation splits, and provide a detailed description of the encoder-matching protocol so that the robustness of the 2.18-bit IG claim and the 46% improvement can be fully assessed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; log-likelihood equivalence to IG is definitional, not fitted

full rationale

The paper frames scanpath modeling as autoregressive token prediction after coordinate discretization and states that per-fixation log-likelihoods are 'directly equivalent to the commonly used Information Gain (IG) metric.' This equivalence follows from the standard definition of IG in saliency/scanpath literature (model log-likelihood relative to baseline) rather than any parameter being fitted to match the evaluation metric. No self-citations are load-bearing for the central claim, no uniqueness theorems are invoked, no ansatz is smuggled, and no predictions reduce to inputs by construction. The reported SOTA numbers (e.g., 2.18 bits IG) are therefore independent empirical outcomes of the trained model, not forced by the evaluation procedure itself. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that coordinate tokenization is information-preserving for scanpath statistics; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Discretizing fixation coordinates into a text vocabulary allows pretrained VLM representations to capture the necessary spatial and sequential structure of human scanpaths.
    This premise enables the entire prompting-based framework without custom architectural additions.

pith-pipeline@v0.9.1-grok · 5829 in / 1310 out tokens · 24633 ms · 2026-07-03T15:47:26.465357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Aydemir, B., Hoffstetter, L., Zhang, T., Salzmann, M., S"usstrunk, S.: Tempsal - uncovering temporal information for deep saliency prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  2. [2]

    Vision Research19(9), 967–983 (1979)

    Becker, W., Jürgens, R.: An analysis of the saccadic system by means of double step stimuli. Vision Research19(9), 967–983 (1979)

  3. [3]

    Physica A: Statistical Mechanics and its Applications331(1-2), 207–218 (2004)

    Boccignone, G., Ferraro, M.: Modelling gaze shift as a constrained random walk. Physica A: Statistical Mechanics and its Applications331(1-2), 207–218 (2004)

  4. [4]

    CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research

    Borji, A., Itti, L.: Cat2000: A large scale fixation dataset for boosting saliency research. arXiv preprint arXiv:1505.03581 (2015)

  5. [5]

    Vision Research116, 165–178 (2015)

    Bylinskii, Z., Isola, P., Bainbridge, C., Torralba, A., Oliva, A.: Intrinsic and ex- trinsic effects on image memorability. Vision Research116, 165–178 (2015)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

    Cartella, G., Cuculo, V., D’Amelio, A., Cornia, M., Boccignone, G., Cucchiara, R.: Modeling human gaze behavior with diffusion models for unified scanpath pre- diction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

  7. [7]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023) 16 S. Agrawal et al

  8. [8]

    arXiv preprint arXiv:2109.10852 , year=

    Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)

  9. [9]

    Scientific Reports11(1), 8776 (2021)

    Chen, Y., Yang, Z., Ahn, S., Samaras, D., Hoai, M., Zelinsky, G.: Coco-search18: A dataset for predicting goal-directed attention control. Scientific Reports11(1), 8776 (2021)

  10. [10]

    In: NeurIPS Workshop on Gaze Meets ML (2022)

    D’Agostino, F., Schwetlick, L., Bethge, M., Kümmerer, M.: What moves the eyes: Doubling mechanistic model performance using deep networks to discover and test cognitive hypotheses. In: NeurIPS Workshop on Gaze Meets ML (2022)

  11. [11]

    Frontiers in Psychology15(2024)

    David, E.v.d.H., et al.: Potsdam data set of eye movement on natural scenes (dae- mons). Frontiers in Psychology15(2024)

  12. [12]

    Behavior Research Methods, Instruments, & Computers34(4), 455–470 (2002)

    Duchowski, A.T.: A breadth-first survey of eye-tracking applications. Behavior Research Methods, Instruments, & Computers34(4), 455–470 (2002)

  13. [13]

    Journal of Vision15(4), 1–1 (2015)

    Engbert, R., Rothkegel, L.O., Metzner, P., Nuthmann, A.: Scenewalk: A cognitive model for eye movements during natural scene viewing. Journal of Vision15(4), 1–1 (2015)

  14. [14]

    Proceedings of the National Academy of Sciences116(24), 11687–11692 (2019)

    de Haas, B., Iakovidis, A.L., Schwarzkopf, D.S., Gegenfurtner, K.R.: Individual differences in visual salience vary along a single semantic dimension. Proceedings of the National Academy of Sciences116(24), 11687–11692 (2019)

  15. [15]

    Proceedings of the National Academy of Sciences116(24), 11687–11692 (2019).https://doi.org/ 10.1073/pnas.1820553116,https://www.pnas.org/doi/abs/10.1073/pnas

    de Haas, B., Iakovidis, A.L., Schwarzkopf, D.S., Gegenfurtner, K.R.: Individual differences in visual salience vary along semantic dimensions. Proceedings of the National Academy of Sciences116(24), 11687–11692 (2019).https://doi.org/ 10.1073/pnas.1820553116,https://www.pnas.org/doi/abs/10.1073/pnas. 1820553116

  16. [16]

    In: International Conference on Learning Representations (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

  17. [17]

    In: Proceedings of the IEEE international conference on computer vision

    Huang, X., Shen, C., Boix, X., Zhao, Q.: Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In: Proceedings of the IEEE international conference on computer vision. pp. 262–270 (2015)

  18. [18]

    Cognitive vision systems: spatial communication and scene understanding pp

    Itti, L., Koch, C.: A saliency-based search mechanism for guiding visual attention. Cognitive vision systems: spatial communication and scene understanding pp. 97– 106 (2000)

  19. [19]

    IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11), 1254–1259 (1998)

    Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11), 1254–1259 (1998)

  20. [20]

    The Journal of the Acoustical Society of America62(S1), S63–S63 (08 2005).https://doi.org/10.1121/1.2016299, https://doi.org/10.1121/1.2016299

    Jelinek, F., Mercer, R.L., Bahl, L.R., Baker, J.K.: Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America62(S1), S63–S63 (08 2005).https://doi.org/10.1121/1.2016299, https://doi.org/10.1121/1.2016299

  21. [21]

    Image and Vision Computing95, 103887 (2020)

    Jia, S., Bruce, N.D.: Eml-net: An expandable multi-layer network for saliency pre- diction. Image and Vision Computing95, 103887 (2020)

  22. [22]

    In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology

    Jiang, Y., Guo, Z., Rezazadegan Tavakoli, H., Leiva, L.A., Oulasvirta, A.: Eye- former: Predicting personalized scanpaths with transformer-guided reinforcement learning. In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. pp. 1–15. UIST ’24, ACM (Oct 2024).https://doi. org/10.1145/3654777.3676436,http://dx.doi.org/1...

  23. [23]

    In: IEEE International Conference on Computer Vision (ICCV)

    Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: IEEE International Conference on Computer Vision (ICCV). pp. 2106– 2113 (2009)

  24. [24]

    Khanuja,H.S.,Kümmerer,M.,Bethge,M.:Modelingsaliencydatasetbias.In:Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (2025) DeepGaze3.5-VL 17

  25. [25]

    arXiv preprint arXiv:2102.12112 (2021)

    Kümmerer, M., Bethge, M.: State-of-the-art in human scanpath prediction. arXiv preprint arXiv:2102.12112 (2021)

  26. [26]

    In: ICLR (2015)

    Kümmerer, M., Theis, L., Bethge, M.: Deepgaze i: Boosting saliency prediction with feature maps trained on imagenet. In: ICLR (2015)

  27. [27]

    In: Proceedings of the National Academy of Sciences

    Kümmerer, M., Wallis, T.S., Bethge, M.: Information-theoretic model compari- son unifies saliency metrics. In: Proceedings of the National Academy of Sciences. vol. 112, pp. 16054–16059 (2015)

  28. [28]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Kümmerer, M., Wallis, T.S., Bethge, M.: Deepgaze iii: Modeling spatial stimulus context and image processing for gaze prediction in scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6817–6826 (2021)

  29. [29]

    Vision research121, 72–84 (2016)

    Le Meur, O., Coutrot, A.: Introducing context-dependent and spatially-variant viewing biases in saccadic models. Vision research121, 72–84 (2016)

  30. [30]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Linardos, A., Kümmerer, M., Press, O., Bethge, M.: Deepgaze iie: Calibrated pre- diction in and out-of-domain for state-of-the-art saliency modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6107–6116 (2021)

  31. [31]

    In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Mondal, S., Yang, Z., Ahn, S., Samaras, D., Zelinsky, G., Hoai, M.: Gazeformer: Scalable, effective and fast prediction of goal-directed human attention. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1441–1450. IEEE (2023)

  32. [32]

    In: Euro- pean Conference on Computer Vision (ECCV)

    Mondal, S., Yang, Z., Ahn, S., Samaras, D., Zelinsky, G., Hoai, M.: Gazexplain: Learning to predict natural language explanations of visual scanpaths. In: Euro- pean Conference on Computer Vision (ECCV). Springer (2024)

  33. [33]

    IEEE Transactions on Pattern Analysis and Machine Intelligence41(7), 1720–1733 (2018)

    Palazzi, A., Abati, D., Calderara, S., Solera, F., Cucchiara, R.: Predicting the driver’s focus of attention: the DR(eye)VE project. IEEE Transactions on Pattern Analysis and Machine Intelligence41(7), 1720–1733 (2018)

  34. [34]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Zheng, F.: Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  35. [35]

    Journal of Vision19(3), 1–1 (03 2019).https: //doi.org/10.1167/19.3.1,https://doi.org/10.1167/19.3.1

    Schütt, H.H., Rothkegel, L.O.M., Trukenbrod, H.A., Engbert, R., Wichmann, F.A.: Disentangling bottom-up versus top-down and low-level versus high-level influences on eye movements over time. Journal of Vision19(3), 1–1 (03 2019).https: //doi.org/10.1167/19.3.1,https://doi.org/10.1167/19.3.1

  36. [36]

    Visual Cognition17(6-7), 1007–1028 (2009)

    Smith, T.J., Henderson, J.M.: Facilitation of return during scene viewing. Visual Cognition17(6-7), 1007–1028 (2009)

  37. [37]

    Incomplete Ideas (blog)13(1), 38 (2019)

    Sutton, R.S.: The bitter lesson. Incomplete Ideas (blog)13(1), 38 (2019)

  38. [38]

    Journal of Vision7(14), 4–4 (11 2007).https://doi.org/10.1167/7.14.4, https://doi.org/10.1167/7.14.4

    Tatler, B.W.: The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. Journal of Vision7(14), 4–4 (11 2007).https://doi.org/10.1167/7.14.4, https://doi.org/10.1167/7.14.4

  39. [39]

    Visual Cognition12(3), 473–494 (2005)

    Unema, P.J.A., Pannasch, S., Joos, M., Velichkovsky, B.M.: Time course of in- formation processing during scene perception. Visual Cognition12(3), 473–494 (2005)

  40. [40]

    In: Proceedings of the XXVII Conference of the Cognitive Science Society, pp

    Velichkovsky, B.M., Joos, M., Helmert, J.R., Pannasch, S.: Two visual systems and their eye movements: evidence from static and dynamic scene perception. In: Proceedings of the XXVII Conference of the Cognitive Science Society, pp. 2283–

  41. [41]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025) 18 S

    Wang, Y., Cheng, Z., Zhang, H., Zheng, X., Chen, X., Jiang, Y.: Tpp-gaze: Mod- elling gaze dynamics in space and time with neural temporal point processes. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2025) 18 S. Agrawal et al

  42. [42]

    Psychonomic Bulletin & Review28(4), 1060–1092 (2021)

    Wolfe, J.M.: Guided search 6.0: An updated model of visual search. Psychonomic Bulletin & Review28(4), 1060–1092 (2021)

  43. [43]

    In: IEEE Conf

    Xue, R., Xu, J., Mondal, S., Le, H., Zelinsky, G., Hoai, M., Samaras, D.: Few-shot personalized scanpath prediction. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13497–13507. IEEE (Jun 2025). https://doi.org/10.1109/cvpr52734.2025.01260,http://dx.doi.org/10. 1109/cvpr52734.2025.01260

  44. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, Z., Huang, L., Chen, Y., Wei, Z., Ahn, S., Zelinsky, G., Samaras, D., Hoai, M.:Predictinggoal-directedhumanattentionusingahumanattentiontransformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5039–5049 (2022)

  45. [45]

    mean scanpath

    Zelinsky, G.J.: A theory of eye movements during target acquisition. Psychological Review115(4), 787–835 (2008) DeepGaze3.5-VL 1 6 Appendix 6.1 Few-Shot Adaptation to New Datasets 1 10 50 100 250 500 1000 Number of adaptation samples 1.95 2.00 2.05 2.10 2.15Information Gain [bits/fix] 0% 3% 3% 9% 31% 54% 61% Few-Shot Dataset Adaptation MIT 5 Datasets (upp...