Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

Miguel P. Eckstein; Sana Shehabi; Shravan Murlidaran; Ziqi Wen

arxiv: 2605.17823 · v1 · pith:EZTU4UCEnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

Shravan Murlidaran , Ziqi Wen , Sana Shehabi , Miguel P. Eckstein This is my paper

Pith reviewed 2026-05-20 12:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords foveated visioneye fixationsscene understandingvisual language modelhuman-like attentionfree-viewingemergent patternscomputational modeling

0 comments

The pith

A foveated visual language model trained only to maximize scene understanding develops human-like eye fixation patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether human free-viewing eye fixations emerge from the goal of understanding scenes given the limits of foveated vision. It introduces a computational agent that uses simulated foveation and is trained to comprehend entire scenes. This agent produces fixation sequences that closely resemble those of human observers, including starting at the center and then moving to people, text, and important objects. Agents trained on different goals like searching for items or classifying the scene, or with mismatched peripheral vision, do not match human patterns as well. This matters because it offers a functional explanation for why we look where we look without assuming explicit tasks.

Core claim

When a visual language model with simulated foveation is trained to optimize scene comprehension, it develops fixation patterns that match human free-viewing behavior, such as initial central fixations followed by attention to people, text, objects being gazed at or grasped, and semantically meaningful regions. Models trained instead to search or classify scenes, or given peripheral vision that is better or worse than human vision, match human fixations less accurately. The paper concludes that these human-like patterns can emerge as a byproduct of pursuing scene understanding under the constraints of foveated vision.

What carries the argument

The scene-comprehension-trained foveated visual language model, which simulates human-like peripheral vision degradation and learns fixation policies to maximize overall scene understanding.

If this is right

Fixation patterns in free viewing arise specifically from the scene comprehension objective rather than from search or classification tasks.
The accuracy of the peripheral vision simulation is critical for producing human-like fixations.
Human eye movements may serve to optimize information gathering for scene understanding rather than other perceptual goals.
These emergent patterns can be reproduced in artificial systems without direct supervision on eye movement data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that attention mechanisms in AI could be improved by incorporating similar optimization objectives and biological constraints.
Similar emergent behaviors might appear in other sensory modalities or tasks when biological limits are modeled.
Testing the model on dynamic scenes or with added eye movement costs could further validate or refine the predictions.

Load-bearing premise

The assumption that optimizing for scene comprehension is the primary functional goal driving human free-viewing fixations and that the simulated foveation and model architecture adequately capture the relevant biological constraints.

What would settle it

A direct test would be to compare the fixation predictions of the scene-comprehension model against human data on a large set of novel scenes; if the match is no better than models trained on other objectives, the claim would be weakened.

read the original abstract

When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scene comprehension training plus human-like foveation produces fixation patterns that match human free-viewing data better than search, classify, or mismatched foveation controls.

read the letter

The main takeaway is that a foveated model trained to maximize scene comprehension generates fixation patterns that align with human free-viewing signatures, while the same architecture under search or classification objectives, or with non-human foveation, matches less well. The ablations isolate the effect to that specific objective and foveation level rather than to foveation or modeling in general. The manuscript spells out the architecture, reward formulation for comprehension, and the metrics for scoring model versus human fixations, which lets the emergence result stand on the comparisons instead of on a single run. That level of control is what makes the empirical finding usable. The softer part is the functional interpretation. The authors present the match as evidence that human-like fixations can arise as a byproduct of optimizing comprehension under foveated constraints, which is a reasonable reading of the data. It does not claim this is the only possible driver or that the model proves biological necessity, so the claim stays within what the experiments show. No internal contradictions appear in the setup or results. This work sits at the overlap of computational vision and vision science. Readers who build or analyze attention mechanisms in models, or who want controlled tests linking objectives to eye-movement data, will get direct value from the ablation structure. The empirical core is grounded enough to merit a full review rather than a desk rejection. I would send it to peer review.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that a foveated visual language model trained to optimize scene comprehension produces emergent human-like fixation patterns during free viewing, including initial central bias and subsequent fixations on people, text, objects, and semantically meaningful regions. Direct ablations show that this objective yields better matches to human data than training for search or classification, or using peripheral vision parameters that deviate from human foveation.

Significance. If the results hold, the work supplies a functional hypothesis for human free-viewing fixations as a byproduct of maximizing scene understanding under biological foveation constraints. The explicit ablations on training objectives and foveation parameters provide concrete evidence distinguishing the scene-comprehension account from alternatives. This strengthens links between computational models of attention and biological vision and offers a testable framework for designing foveated vision systems in AI. The architecture, reward formulation, and evaluation metrics are specified in sufficient detail to support verification of the emergence finding.

minor comments (2)

Abstract: the summary of comparative results would be strengthened by a single sentence noting the model family, the source of human fixation data, and the primary quantitative metric used for matching.
Figure captions and results section: ensure every ablation condition is explicitly labeled and that statistical comparisons (e.g., p-values or confidence intervals) between the scene-comprehension model and baselines are reported for each key fixation signature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation for minor revision. We appreciate the accurate characterization of our contributions regarding emergent human-like fixations from scene-comprehension training under foveated constraints.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via independent ablations and held-out data

full rationale

The paper's central claim rests on training a foveated model under a scene-comprehension objective and directly comparing the resulting fixation patterns to human data via ablations against search/classify objectives and non-human foveation parameters. No equation or result is shown to reduce by construction to a fitted parameter or self-citation; the match to human fixations is evaluated on independent measurements rather than the training data itself. The architecture, reward formulation, and metrics are specified with sufficient detail to stand alone, and the functional interpretation is framed as a hypothesis supported by the empirical contrasts rather than an imported uniqueness theorem or ansatz. This yields a self-contained empirical finding without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; model training necessarily involves many implicit hyperparameters and architectural choices whose values are not reported.

pith-pipeline@v0.9.0 · 5674 in / 1042 out tokens · 24595 ms · 2026-05-20T12:20:48.993618+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fRL-SU model … trained to optimize scene comprehension … semantic accuracy … cosine similarity … REINFORCE
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

foveated RL … four fixations … center bias … people/text/objects

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 7 internal anchors

[1]

M. P. Eckstein, B. R. Beutter, L. S. Stone, Quantifying the performance limits of human saccadic targeting during visual search.Perception30(11), 1389–1401 (2001)

work page 2001
[2]

Najemnik, W

J. Najemnik, W. S. Geisler, Optimal eye movement strategies in visual search.Nature 434(7031), 387–391 (2005), doi:10.1038/nature03390

work page doi:10.1038/nature03390 2005
[3]

G. L. Malcolm, J. M. Henderson, Combining top-down processes to guide eye movements during real-world scene search.Journal of Vision10(2), 4–4 (2010)

work page 2010
[4]

Hoppe, C

D. Hoppe, C. A. Rothkopf, Multi-step planning of eye movements in visual search.Scientific reports9(1), 144 (2019)

work page 2019
[5]

M. F. Peterson, M. P. Eckstein, Looking just below the eyes is optimal across face recognition tasks.Proceedings of the National Academy of Sciences109(48), E3314–E3323 (2012)

work page 2012
[6]

K. R. Gegenfurtner, The interaction between vision and eye movements.Perception45(12), 1333–1357 (2016)

work page 2016
[7]

A. J. de Brouwer, J. R. Flanagan, M. Spering, Functional use of eye movements for an acting system.Trends in Cognitive Sciences25(3), 252–263 (2021)

work page 2021
[8]

M. M. Hayhoe, R. A. Lerch, Visual Guidance of Natural Behavior (2022), doi:10. 1093/acrefore/9780190236557.013.848,https://oxfordre.com/psychology/view/10. 1093/acrefore/9780190236557.001.0001/acrefore-9780190236557-e-848

work page arXiv 2022
[9]

C. A. Rothkopf, M. M. Hayhoe, Computational elements of natural vision.Journal of Vision 25(12), 4–4 (2025)

work page 2025
[10]

Einh ¨auser, M

W. Einh ¨auser, M. Spain, P. Perona, Objects predict fixations better than early saliency.Journal of vision8(14), 18–18 (2008)

work page 2008
[11]

Koehler, F

K. Koehler, F. Guo, S. Zhang, M. P. Eckstein, What do saliency models predict?Journal of vision14(3), 14–14 (2014). 16

work page 2014
[12]

J. M. Henderson, T. R. Hayes, Meaning-based guidance of attention in scenes as revealed by meaning maps.Nature human behaviour1(10), 743–747 (2017)

work page 2017
[13]

C. E. Peacock, T. R. Hayes, J. M. Henderson, The role of meaning in attentional guidance during free viewing of real-world scenes.Acta Psychologica198, 102889 (2019)

work page 2019
[14]

Murlidaran, M

S. Murlidaran, M. P. Eckstein, Eye movements during free viewing to maximize scene under- standing.Nature Communications17(1) (2025), doi:10.1038/s41467-025-67673-w

work page doi:10.1038/s41467-025-67673-w 2025
[15]

M. Cerf, E. P. Frady, C. Koch, Faces and text attract gaze independent of the task: Experimental data and computer model.Journal of Vision9(12), 10–10 (2009)

work page 2009
[16]

Birmingham, W

E. Birmingham, W. F. Bischof, A. Kingstone, Gaze selection in complex social scenes.Visual cognition16(2-3), 341–355 (2008)

work page 2008
[17]

H.-C. Wang, M. Pomplun, The attraction of visual attention to texts in real-world scenes. Journal of vision12(6), 26–26 (2012)

work page 2012
[18]

De Haas, A

B. De Haas, A. L. Iakovidis, D. S. Schwarzkopf, K. R. Gegenfurtner, Individual differences in visual salience vary along semantic dimensions.Proceedings of the National Academy of Sciences116(24), 11687–11692 (2019)

work page 2019
[19]

Nuthmann, J

A. Nuthmann, J. M. Henderson, Object-based attentional selection in scene viewing.Journal of vision10(8), 20–20 (2010)

work page 2010
[20]

M. Land, N. Mennie, J. Rusted, The roles of vision and eye movements in the control of activities of daily living.Perception28(11), 1311–1328 (1999)

work page 1999
[21]

M. S. Castelhano, M. Wieth, J. M. Henderson, I see what you see: Eye movements in real-world scenes are affected by perceived direction of gaze, inInternational Workshop on Attention in Cognitive Systems(Springer) (2007), pp. 251–262

work page 2007
[22]

K. H. Ruddock, D. S. Wooding, S. K. Mannan, The relationship between the locations of spatial features and those of fixations made during visual examination of briefly presented images. Spatial vision10(3), 165–188 (1996). 17

work page 1996
[23]

B. W. Tatler, The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions.Journal of vision7(14), 4–4 (2007)

work page 2007
[24]

L. C. Loschky, A. M. Larson, T. J. Smith, J. P. Magliano, The scene perception & event comprehension theory (SPECT) applied to visual narratives.Topics in Cognitive Science 12(1), 311–351 (2020)

work page 2020
[25]

M. I. Coco, F. Keller, Scan patterns predict sentence production in the cross-modal processing of visual scenes.Cognitive science36(7), 1204–1223 (2012)

work page 2012
[26]

M. I. Coco, F. Keller, Classification of visual and linguistic tasks using eye-movement features. Journal of Vision14(3), 11–11 (2014)

work page 2014
[27]

L. C. Loschky,et al., The role of event understanding in guiding attentional selection in real-world scenes: The scene perception & event comprehension theory (spect).Attention, Perception, & Psychophysics88(3), 92 (2026)

work page 2026
[28]

Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

S. Lu,et al., Ovis: Structural Embedding Alignment for Multimodal Large Language Model. arXiv:2405.20797(2024)

work page arXiv 2024
[29]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Y. Yao,et al., Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Qwen3-VL Technical Report

S. Bai,et al., Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

C. Clark,et al., Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

J. S. Perry, W. S. Geisler, Gaze-contingent real-time simulation of arbitrary visual fields, in Human vision and electronic imaging VII(SPIE), vol. 4662 (2002), pp. 57–69

work page 2002
[33]

M. P. Eckstein, W. Schoonveld, S. Zhang, S. C. Mack, E. Akbas, Optimal and human eye movements to clustered low value cues to increase decision rewards during search.Vision Research113, 137–154 (2015). 18

work page 2015
[34]

Zhang, M

S. Zhang, M. P. Eckstein, Evolution and optimality of similar neural mechanisms for perception and action during search.PLoS Computational Biology6(9), e1000930 (2010)

work page 2010
[35]

J. F. Ackermann, M. S. Landy, Choice of saccade endpoint under risk.Journal of vision13(3), 27–27 (2013)

work page 2013
[36]

T. H. Weisswange, C. A. Rothkopf, T. Rodemann, J. Triesch, Can reinforcement learning explain the development of causal inference in multisensory integration?, in2009 IEEE 8th International Conference on Development and Learning(IEEE) (2009), pp. 1–7

work page 2009
[37]

Sullivan, L

B. Sullivan, L. Johnson, D. Ballard, M. Hayhoe, A modular reinforcement learning model for human visuomotor behaviour in a driving task, inProc. AISB 2011 Symposium(2011), pp. 33–40

work page 2011
[38]

Mnih,et al., Human-level control through deep reinforcement learning.nature518(7540), 529–533 (2015)

V. Mnih,et al., Human-level control through deep reinforcement learning.nature518(7540), 529–533 (2015)

work page 2015
[39]

W. Zhou, M. P. Eckstein, A deep Q-learning method for optimizing visual search strategies in backgrounds of dynamic noise, inMedical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment(SPIE), vol. 12035 (2022), pp. 60–67

work page 2022
[40]

K¨ ummerer, M

M. K¨ ummerer, M. Bethge, T. S. Wallis, DeepGaze III: Modeling free-viewing human scanpaths with deep learning.Journal of Vision22(5), 7–7 (2022)

work page 2022
[41]

G. Cartella,et al., Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction, inProceedings of the IEEE/CVF International Conference on Computer Vision (2025), pp. 16206–16216

work page 2025
[42]

Assens, X

M. Assens, X. Giro-i Nieto, K. McGuinness, N. E. O’Connor, PathGAN: Visual scanpath prediction with generative adversarial networks, inProceedings of the European Conference on Computer Vision (ECCV) Workshops(2018), pp. 0–0

work page 2018
[43]

M. D. Anderson, E. W. Graf, J. H. Elder, K. A. Ehinger, W. J. Adams, Category systems for real-world scenes.Journal of vision21(2), 8–8 (2021). 19

work page 2021
[44]

Harel, C

J. Harel, C. Koch, P. Perona, Graph-based visual saliency.Advances in Neural Information Processing Systems19(2006)

work page 2006
[45]

L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence20(11), 1254–1259 (1998), conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/34.730558

work page doi:10.1109/34.730558 1998
[46]

Linka, H

M. Linka, H. Karimpur, B. de Haas, Protracted development of gaze behaviour.Nature Human Behaviour9(9), 1887–1897 (2025)

work page 2025
[47]

Akbas, M

E. Akbas, M. P. Eckstein, Object detection through search with a foveated visual system.PLoS computational biology13(10), e1005743 (2017)

work page 2017
[48]

Microsoft COCO: Common Objects in Context

T.-Y. Lin,et al., Microsoft COCO: Common Objects in Context.arXiv:1405.0312 [cs](2015), arXiv: 1405.0312,http://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015
[49]

Andriluka, L

M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2D Human Pose Estimation: New Bench- mark and State of the Art Analysis, inIEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2014)

work page 2014
[50]

Mertens, E

L. Mertens, E. Yargholi, H. Op de Beeck, J. Van den Stock, J. Vennekens, Findingemo: An image dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems37, 4956–4996 (2024)

work page 2024
[51]

A. D. Clarke, B. W. Tatler, Deriving an appropriate baseline for describing fixation behaviour. Vision research102, 41–51 (2014)

work page 2014
[52]

Murlidaran, Z

S. Murlidaran, Z. Wen, J. Skaza, W. Wang, M. P. Eckstein, Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps (2025)

work page 2025
[53]

Tseng, R

P.-H. Tseng, R. Carmi, I. G. Cameron, D. P. Munoz, L. Itti, Quantifying center bias of observers in free viewing of dynamic natural scenes.Journal of vision9(7), 4–4 (2009). 20

work page 2009
[54]

Rosenberg, S

G. Rosenberg, S. Stadhard, B. C. Hansen, M. R. Greene, The Limits of Learning from Pic- tures and Text: Vision-Language Models and Embodied Scene Understanding.arXiv preprint arXiv:2603.26589(2026)

work page arXiv 2026
[55]

F. A. Wichmann, R. Geirhos, Are deep neural networks adequate behavioral models of human visual perception?Annual review of vision science9(1), 501–524 (2023)

work page 2023
[56]

Martelli, N

M. Martelli, N. J. Majaj, D. G. Pelli, Are faces processed like words? A diagnostic test for recognition by parts.Journal of Vision5(1), 6–6 (2005)

work page 2005
[57]

Jiang, S

M. Jiang, S. Huang, J. Duan, Q. Zhao, SALICON: Saliency in Context, inThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2015)

work page 2015
[58]

Ouyang, Image Foveation Python: Python implementation of image foveation,https: //github.com/ouyangzhibo/Image_Foveation_Python(2018), gitHub repository, ac- cessed: 2018

Z. Ouyang, Image Foveation Python: Python implementation of image foveation,https: //github.com/ouyangzhibo/Image_Foveation_Python(2018), gitHub repository, ac- cessed: 2018

work page 2018
[59]

Ovis2.5 Technical Report

S. Lu,et al., Ovis2.5 Technical Report.arXiv:2508.11737(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Dehghani,et al., Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

M. Dehghani,et al., Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems36, 2252–2274 (2023)

work page 2023
[61]

Qwen3 Technical Report

A. Yang,et al., Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024

D. Zhang, J. Li, Z. Zeng, F. Wang, Jasper and stella: distillation of sota embedding models. arXiv preprint arXiv:2412.19048(2024)

work page arXiv 2024
[63]

R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning8(3), 229–256 (1992)

work page 1992
[64]

Kwon,vLLM: An Efficient Inference Engine for Large Language Models, Ph.D

W. Kwon,vLLM: An Efficient Inference Engine for Large Language Models, Ph.D. thesis, UC Berkeley (2025)

work page 2025
[65]

I. Oruc, F. Shafai, S. Murthy, P. Lages, T. Ton, The adult face-diet: A naturalistic observation study.Vision research157, 222–229 (2019). 21

work page 2019
[66]

Linardos, M

A. Linardos, M. K¨ ummerer, O. Press, M. Bethge, DeepGaze IIE: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling, inProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp. 12919–12928

work page 2021
[67]

K¨ ummerer, Deep Gaze ii,https://github.com/matthias-k/DeepGaze(2016)

M. K¨ ummerer, Deep Gaze ii,https://github.com/matthias-k/DeepGaze(2016)

work page 2016
[68]

K¨ ummerer, Saleincy Models Implementation,https://github.com/matthias-k/ pysaliency(2015)

M. K¨ ummerer, Saleincy Models Implementation,https://github.com/matthias-k/ pysaliency(2015)

work page 2015
[69]

SAM 3: Segment Anything with Concepts

N. Carion,et al., Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Linka, B

M. Linka, B. de Haas, OSIEshort: A small stimulus set can reliably estimate individual differ- ences in semantic salience.Journal of vision20(9), 13–13 (2020)

work page 2020
[71]

Provide your best description of the smallest man-made object present and clearly visible in the image in a sentence. Do not mention the blur seen in the picture

J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, Q. Zhao, Predicting human gaze beyond pixels. Journal of Vision14(1), 28–28 (2014). Acknowledgments We would like to thank Professor William Wang of the Computer Science Department for his invaluable support and guidance throughout this project. We also extend our gratitude to our lab members, Srijita, Parsa, a...

work page arXiv 2014

[1] [1]

M. P. Eckstein, B. R. Beutter, L. S. Stone, Quantifying the performance limits of human saccadic targeting during visual search.Perception30(11), 1389–1401 (2001)

work page 2001

[2] [2]

Najemnik, W

J. Najemnik, W. S. Geisler, Optimal eye movement strategies in visual search.Nature 434(7031), 387–391 (2005), doi:10.1038/nature03390

work page doi:10.1038/nature03390 2005

[3] [3]

G. L. Malcolm, J. M. Henderson, Combining top-down processes to guide eye movements during real-world scene search.Journal of Vision10(2), 4–4 (2010)

work page 2010

[4] [4]

Hoppe, C

D. Hoppe, C. A. Rothkopf, Multi-step planning of eye movements in visual search.Scientific reports9(1), 144 (2019)

work page 2019

[5] [5]

M. F. Peterson, M. P. Eckstein, Looking just below the eyes is optimal across face recognition tasks.Proceedings of the National Academy of Sciences109(48), E3314–E3323 (2012)

work page 2012

[6] [6]

K. R. Gegenfurtner, The interaction between vision and eye movements.Perception45(12), 1333–1357 (2016)

work page 2016

[7] [7]

A. J. de Brouwer, J. R. Flanagan, M. Spering, Functional use of eye movements for an acting system.Trends in Cognitive Sciences25(3), 252–263 (2021)

work page 2021

[8] [8]

M. M. Hayhoe, R. A. Lerch, Visual Guidance of Natural Behavior (2022), doi:10. 1093/acrefore/9780190236557.013.848,https://oxfordre.com/psychology/view/10. 1093/acrefore/9780190236557.001.0001/acrefore-9780190236557-e-848

work page arXiv 2022

[9] [9]

C. A. Rothkopf, M. M. Hayhoe, Computational elements of natural vision.Journal of Vision 25(12), 4–4 (2025)

work page 2025

[10] [10]

Einh ¨auser, M

W. Einh ¨auser, M. Spain, P. Perona, Objects predict fixations better than early saliency.Journal of vision8(14), 18–18 (2008)

work page 2008

[11] [11]

Koehler, F

K. Koehler, F. Guo, S. Zhang, M. P. Eckstein, What do saliency models predict?Journal of vision14(3), 14–14 (2014). 16

work page 2014

[12] [12]

J. M. Henderson, T. R. Hayes, Meaning-based guidance of attention in scenes as revealed by meaning maps.Nature human behaviour1(10), 743–747 (2017)

work page 2017

[13] [13]

C. E. Peacock, T. R. Hayes, J. M. Henderson, The role of meaning in attentional guidance during free viewing of real-world scenes.Acta Psychologica198, 102889 (2019)

work page 2019

[14] [14]

Murlidaran, M

S. Murlidaran, M. P. Eckstein, Eye movements during free viewing to maximize scene under- standing.Nature Communications17(1) (2025), doi:10.1038/s41467-025-67673-w

work page doi:10.1038/s41467-025-67673-w 2025

[15] [15]

M. Cerf, E. P. Frady, C. Koch, Faces and text attract gaze independent of the task: Experimental data and computer model.Journal of Vision9(12), 10–10 (2009)

work page 2009

[16] [16]

Birmingham, W

E. Birmingham, W. F. Bischof, A. Kingstone, Gaze selection in complex social scenes.Visual cognition16(2-3), 341–355 (2008)

work page 2008

[17] [17]

H.-C. Wang, M. Pomplun, The attraction of visual attention to texts in real-world scenes. Journal of vision12(6), 26–26 (2012)

work page 2012

[18] [18]

De Haas, A

B. De Haas, A. L. Iakovidis, D. S. Schwarzkopf, K. R. Gegenfurtner, Individual differences in visual salience vary along semantic dimensions.Proceedings of the National Academy of Sciences116(24), 11687–11692 (2019)

work page 2019

[19] [19]

Nuthmann, J

A. Nuthmann, J. M. Henderson, Object-based attentional selection in scene viewing.Journal of vision10(8), 20–20 (2010)

work page 2010

[20] [20]

M. Land, N. Mennie, J. Rusted, The roles of vision and eye movements in the control of activities of daily living.Perception28(11), 1311–1328 (1999)

work page 1999

[21] [21]

M. S. Castelhano, M. Wieth, J. M. Henderson, I see what you see: Eye movements in real-world scenes are affected by perceived direction of gaze, inInternational Workshop on Attention in Cognitive Systems(Springer) (2007), pp. 251–262

work page 2007

[22] [22]

K. H. Ruddock, D. S. Wooding, S. K. Mannan, The relationship between the locations of spatial features and those of fixations made during visual examination of briefly presented images. Spatial vision10(3), 165–188 (1996). 17

work page 1996

[23] [23]

B. W. Tatler, The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions.Journal of vision7(14), 4–4 (2007)

work page 2007

[24] [24]

L. C. Loschky, A. M. Larson, T. J. Smith, J. P. Magliano, The scene perception & event comprehension theory (SPECT) applied to visual narratives.Topics in Cognitive Science 12(1), 311–351 (2020)

work page 2020

[25] [25]

M. I. Coco, F. Keller, Scan patterns predict sentence production in the cross-modal processing of visual scenes.Cognitive science36(7), 1204–1223 (2012)

work page 2012

[26] [26]

M. I. Coco, F. Keller, Classification of visual and linguistic tasks using eye-movement features. Journal of Vision14(3), 11–11 (2014)

work page 2014

[27] [27]

L. C. Loschky,et al., The role of event understanding in guiding attentional selection in real-world scenes: The scene perception & event comprehension theory (spect).Attention, Perception, & Psychophysics88(3), 92 (2026)

work page 2026

[28] [28]

Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

S. Lu,et al., Ovis: Structural Embedding Alignment for Multimodal Large Language Model. arXiv:2405.20797(2024)

work page arXiv 2024

[29] [29]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Y. Yao,et al., Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Qwen3-VL Technical Report

S. Bai,et al., Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

C. Clark,et al., Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

J. S. Perry, W. S. Geisler, Gaze-contingent real-time simulation of arbitrary visual fields, in Human vision and electronic imaging VII(SPIE), vol. 4662 (2002), pp. 57–69

work page 2002

[33] [33]

M. P. Eckstein, W. Schoonveld, S. Zhang, S. C. Mack, E. Akbas, Optimal and human eye movements to clustered low value cues to increase decision rewards during search.Vision Research113, 137–154 (2015). 18

work page 2015

[34] [34]

Zhang, M

S. Zhang, M. P. Eckstein, Evolution and optimality of similar neural mechanisms for perception and action during search.PLoS Computational Biology6(9), e1000930 (2010)

work page 2010

[35] [35]

J. F. Ackermann, M. S. Landy, Choice of saccade endpoint under risk.Journal of vision13(3), 27–27 (2013)

work page 2013

[36] [36]

T. H. Weisswange, C. A. Rothkopf, T. Rodemann, J. Triesch, Can reinforcement learning explain the development of causal inference in multisensory integration?, in2009 IEEE 8th International Conference on Development and Learning(IEEE) (2009), pp. 1–7

work page 2009

[37] [37]

Sullivan, L

B. Sullivan, L. Johnson, D. Ballard, M. Hayhoe, A modular reinforcement learning model for human visuomotor behaviour in a driving task, inProc. AISB 2011 Symposium(2011), pp. 33–40

work page 2011

[38] [38]

Mnih,et al., Human-level control through deep reinforcement learning.nature518(7540), 529–533 (2015)

V. Mnih,et al., Human-level control through deep reinforcement learning.nature518(7540), 529–533 (2015)

work page 2015

[39] [39]

W. Zhou, M. P. Eckstein, A deep Q-learning method for optimizing visual search strategies in backgrounds of dynamic noise, inMedical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment(SPIE), vol. 12035 (2022), pp. 60–67

work page 2022

[40] [40]

K¨ ummerer, M

M. K¨ ummerer, M. Bethge, T. S. Wallis, DeepGaze III: Modeling free-viewing human scanpaths with deep learning.Journal of Vision22(5), 7–7 (2022)

work page 2022

[41] [41]

G. Cartella,et al., Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction, inProceedings of the IEEE/CVF International Conference on Computer Vision (2025), pp. 16206–16216

work page 2025

[42] [42]

Assens, X

M. Assens, X. Giro-i Nieto, K. McGuinness, N. E. O’Connor, PathGAN: Visual scanpath prediction with generative adversarial networks, inProceedings of the European Conference on Computer Vision (ECCV) Workshops(2018), pp. 0–0

work page 2018

[43] [43]

M. D. Anderson, E. W. Graf, J. H. Elder, K. A. Ehinger, W. J. Adams, Category systems for real-world scenes.Journal of vision21(2), 8–8 (2021). 19

work page 2021

[44] [44]

Harel, C

J. Harel, C. Koch, P. Perona, Graph-based visual saliency.Advances in Neural Information Processing Systems19(2006)

work page 2006

[45] [45]

L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence20(11), 1254–1259 (1998), conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/34.730558

work page doi:10.1109/34.730558 1998

[46] [46]

Linka, H

M. Linka, H. Karimpur, B. de Haas, Protracted development of gaze behaviour.Nature Human Behaviour9(9), 1887–1897 (2025)

work page 2025

[47] [47]

Akbas, M

E. Akbas, M. P. Eckstein, Object detection through search with a foveated visual system.PLoS computational biology13(10), e1005743 (2017)

work page 2017

[48] [48]

Microsoft COCO: Common Objects in Context

T.-Y. Lin,et al., Microsoft COCO: Common Objects in Context.arXiv:1405.0312 [cs](2015), arXiv: 1405.0312,http://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015

[49] [49]

Andriluka, L

M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2D Human Pose Estimation: New Bench- mark and State of the Art Analysis, inIEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2014)

work page 2014

[50] [50]

Mertens, E

L. Mertens, E. Yargholi, H. Op de Beeck, J. Van den Stock, J. Vennekens, Findingemo: An image dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems37, 4956–4996 (2024)

work page 2024

[51] [51]

A. D. Clarke, B. W. Tatler, Deriving an appropriate baseline for describing fixation behaviour. Vision research102, 41–51 (2014)

work page 2014

[52] [52]

Murlidaran, Z

S. Murlidaran, Z. Wen, J. Skaza, W. Wang, M. P. Eckstein, Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps (2025)

work page 2025

[53] [53]

Tseng, R

P.-H. Tseng, R. Carmi, I. G. Cameron, D. P. Munoz, L. Itti, Quantifying center bias of observers in free viewing of dynamic natural scenes.Journal of vision9(7), 4–4 (2009). 20

work page 2009

[54] [54]

Rosenberg, S

G. Rosenberg, S. Stadhard, B. C. Hansen, M. R. Greene, The Limits of Learning from Pic- tures and Text: Vision-Language Models and Embodied Scene Understanding.arXiv preprint arXiv:2603.26589(2026)

work page arXiv 2026

[55] [55]

F. A. Wichmann, R. Geirhos, Are deep neural networks adequate behavioral models of human visual perception?Annual review of vision science9(1), 501–524 (2023)

work page 2023

[56] [56]

Martelli, N

M. Martelli, N. J. Majaj, D. G. Pelli, Are faces processed like words? A diagnostic test for recognition by parts.Journal of Vision5(1), 6–6 (2005)

work page 2005

[57] [57]

Jiang, S

M. Jiang, S. Huang, J. Duan, Q. Zhao, SALICON: Saliency in Context, inThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2015)

work page 2015

[58] [58]

Ouyang, Image Foveation Python: Python implementation of image foveation,https: //github.com/ouyangzhibo/Image_Foveation_Python(2018), gitHub repository, ac- cessed: 2018

Z. Ouyang, Image Foveation Python: Python implementation of image foveation,https: //github.com/ouyangzhibo/Image_Foveation_Python(2018), gitHub repository, ac- cessed: 2018

work page 2018

[59] [59]

Ovis2.5 Technical Report

S. Lu,et al., Ovis2.5 Technical Report.arXiv:2508.11737(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Dehghani,et al., Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

M. Dehghani,et al., Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems36, 2252–2274 (2023)

work page 2023

[61] [61]

Qwen3 Technical Report

A. Yang,et al., Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024

D. Zhang, J. Li, Z. Zeng, F. Wang, Jasper and stella: distillation of sota embedding models. arXiv preprint arXiv:2412.19048(2024)

work page arXiv 2024

[63] [63]

R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning8(3), 229–256 (1992)

work page 1992

[64] [64]

Kwon,vLLM: An Efficient Inference Engine for Large Language Models, Ph.D

W. Kwon,vLLM: An Efficient Inference Engine for Large Language Models, Ph.D. thesis, UC Berkeley (2025)

work page 2025

[65] [65]

I. Oruc, F. Shafai, S. Murthy, P. Lages, T. Ton, The adult face-diet: A naturalistic observation study.Vision research157, 222–229 (2019). 21

work page 2019

[66] [66]

Linardos, M

A. Linardos, M. K¨ ummerer, O. Press, M. Bethge, DeepGaze IIE: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling, inProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp. 12919–12928

work page 2021

[67] [67]

K¨ ummerer, Deep Gaze ii,https://github.com/matthias-k/DeepGaze(2016)

M. K¨ ummerer, Deep Gaze ii,https://github.com/matthias-k/DeepGaze(2016)

work page 2016

[68] [68]

K¨ ummerer, Saleincy Models Implementation,https://github.com/matthias-k/ pysaliency(2015)

M. K¨ ummerer, Saleincy Models Implementation,https://github.com/matthias-k/ pysaliency(2015)

work page 2015

[69] [69]

SAM 3: Segment Anything with Concepts

N. Carion,et al., Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

Linka, B

M. Linka, B. de Haas, OSIEshort: A small stimulus set can reliably estimate individual differ- ences in semantic salience.Journal of vision20(9), 13–13 (2020)

work page 2020

[71] [71]

Provide your best description of the smallest man-made object present and clearly visible in the image in a sentence. Do not mention the blur seen in the picture

J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, Q. Zhao, Predicting human gaze beyond pixels. Journal of Vision14(1), 28–28 (2014). Acknowledgments We would like to thank Professor William Wang of the Computer Science Department for his invaluable support and guidance throughout this project. We also extend our gratitude to our lab members, Srijita, Parsa, a...

work page arXiv 2014