Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding
Pith reviewed 2026-05-20 12:20 UTC · model grok-4.3
The pith
A foveated visual language model trained only to maximize scene understanding develops human-like eye fixation patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When a visual language model with simulated foveation is trained to optimize scene comprehension, it develops fixation patterns that match human free-viewing behavior, such as initial central fixations followed by attention to people, text, objects being gazed at or grasped, and semantically meaningful regions. Models trained instead to search or classify scenes, or given peripheral vision that is better or worse than human vision, match human fixations less accurately. The paper concludes that these human-like patterns can emerge as a byproduct of pursuing scene understanding under the constraints of foveated vision.
What carries the argument
The scene-comprehension-trained foveated visual language model, which simulates human-like peripheral vision degradation and learns fixation policies to maximize overall scene understanding.
If this is right
- Fixation patterns in free viewing arise specifically from the scene comprehension objective rather than from search or classification tasks.
- The accuracy of the peripheral vision simulation is critical for producing human-like fixations.
- Human eye movements may serve to optimize information gathering for scene understanding rather than other perceptual goals.
- These emergent patterns can be reproduced in artificial systems without direct supervision on eye movement data.
Where Pith is reading between the lines
- This suggests that attention mechanisms in AI could be improved by incorporating similar optimization objectives and biological constraints.
- Similar emergent behaviors might appear in other sensory modalities or tasks when biological limits are modeled.
- Testing the model on dynamic scenes or with added eye movement costs could further validate or refine the predictions.
Load-bearing premise
The assumption that optimizing for scene comprehension is the primary functional goal driving human free-viewing fixations and that the simulated foveation and model architecture adequately capture the relevant biological constraints.
What would settle it
A direct test would be to compare the fixation predictions of the scene-comprehension model against human data on a large set of novel scenes; if the match is no better than models trained on other objectives, the claim would be weakened.
read the original abstract
When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a foveated visual language model trained to optimize scene comprehension produces emergent human-like fixation patterns during free viewing, including initial central bias and subsequent fixations on people, text, objects, and semantically meaningful regions. Direct ablations show that this objective yields better matches to human data than training for search or classification, or using peripheral vision parameters that deviate from human foveation.
Significance. If the results hold, the work supplies a functional hypothesis for human free-viewing fixations as a byproduct of maximizing scene understanding under biological foveation constraints. The explicit ablations on training objectives and foveation parameters provide concrete evidence distinguishing the scene-comprehension account from alternatives. This strengthens links between computational models of attention and biological vision and offers a testable framework for designing foveated vision systems in AI. The architecture, reward formulation, and evaluation metrics are specified in sufficient detail to support verification of the emergence finding.
minor comments (2)
- Abstract: the summary of comparative results would be strengthened by a single sentence noting the model family, the source of human fixation data, and the primary quantitative metric used for matching.
- Figure captions and results section: ensure every ablation condition is explicitly labeled and that statistical comparisons (e.g., p-values or confidence intervals) between the scene-comprehension model and baselines are reported for each key fixation signature.
Simulated Author's Rebuttal
We thank the referee for their positive summary, recognition of the work's significance, and recommendation for minor revision. We appreciate the accurate characterization of our contributions regarding emergent human-like fixations from scene-comprehension training under foveated constraints.
Circularity Check
No significant circularity; derivation is self-contained via independent ablations and held-out data
full rationale
The paper's central claim rests on training a foveated model under a scene-comprehension objective and directly comparing the resulting fixation patterns to human data via ablations against search/classify objectives and non-human foveation parameters. No equation or result is shown to reduce by construction to a fitted parameter or self-citation; the match to human fixations is evaluated on independent measurements rather than the training data itself. The architecture, reward formulation, and metrics are specified with sufficient detail to stand alone, and the functional interpretation is framed as a hypothesis supported by the empirical contrasts rather than an imported uniqueness theorem or ansatz. This yields a self-contained empirical finding without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fRL-SU model … trained to optimize scene comprehension … semantic accuracy … cosine similarity … REINFORCE
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
foveated RL … four fixations … center bias … people/text/objects
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. P. Eckstein, B. R. Beutter, L. S. Stone, Quantifying the performance limits of human saccadic targeting during visual search.Perception30(11), 1389–1401 (2001)
work page 2001
-
[2]
J. Najemnik, W. S. Geisler, Optimal eye movement strategies in visual search.Nature 434(7031), 387–391 (2005), doi:10.1038/nature03390
-
[3]
G. L. Malcolm, J. M. Henderson, Combining top-down processes to guide eye movements during real-world scene search.Journal of Vision10(2), 4–4 (2010)
work page 2010
- [4]
-
[5]
M. F. Peterson, M. P. Eckstein, Looking just below the eyes is optimal across face recognition tasks.Proceedings of the National Academy of Sciences109(48), E3314–E3323 (2012)
work page 2012
-
[6]
K. R. Gegenfurtner, The interaction between vision and eye movements.Perception45(12), 1333–1357 (2016)
work page 2016
-
[7]
A. J. de Brouwer, J. R. Flanagan, M. Spering, Functional use of eye movements for an acting system.Trends in Cognitive Sciences25(3), 252–263 (2021)
work page 2021
- [8]
-
[9]
C. A. Rothkopf, M. M. Hayhoe, Computational elements of natural vision.Journal of Vision 25(12), 4–4 (2025)
work page 2025
-
[10]
W. Einh ¨auser, M. Spain, P. Perona, Objects predict fixations better than early saliency.Journal of vision8(14), 18–18 (2008)
work page 2008
-
[11]
K. Koehler, F. Guo, S. Zhang, M. P. Eckstein, What do saliency models predict?Journal of vision14(3), 14–14 (2014). 16
work page 2014
-
[12]
J. M. Henderson, T. R. Hayes, Meaning-based guidance of attention in scenes as revealed by meaning maps.Nature human behaviour1(10), 743–747 (2017)
work page 2017
-
[13]
C. E. Peacock, T. R. Hayes, J. M. Henderson, The role of meaning in attentional guidance during free viewing of real-world scenes.Acta Psychologica198, 102889 (2019)
work page 2019
-
[14]
S. Murlidaran, M. P. Eckstein, Eye movements during free viewing to maximize scene under- standing.Nature Communications17(1) (2025), doi:10.1038/s41467-025-67673-w
-
[15]
M. Cerf, E. P. Frady, C. Koch, Faces and text attract gaze independent of the task: Experimental data and computer model.Journal of Vision9(12), 10–10 (2009)
work page 2009
-
[16]
E. Birmingham, W. F. Bischof, A. Kingstone, Gaze selection in complex social scenes.Visual cognition16(2-3), 341–355 (2008)
work page 2008
-
[17]
H.-C. Wang, M. Pomplun, The attraction of visual attention to texts in real-world scenes. Journal of vision12(6), 26–26 (2012)
work page 2012
-
[18]
B. De Haas, A. L. Iakovidis, D. S. Schwarzkopf, K. R. Gegenfurtner, Individual differences in visual salience vary along semantic dimensions.Proceedings of the National Academy of Sciences116(24), 11687–11692 (2019)
work page 2019
-
[19]
A. Nuthmann, J. M. Henderson, Object-based attentional selection in scene viewing.Journal of vision10(8), 20–20 (2010)
work page 2010
-
[20]
M. Land, N. Mennie, J. Rusted, The roles of vision and eye movements in the control of activities of daily living.Perception28(11), 1311–1328 (1999)
work page 1999
-
[21]
M. S. Castelhano, M. Wieth, J. M. Henderson, I see what you see: Eye movements in real-world scenes are affected by perceived direction of gaze, inInternational Workshop on Attention in Cognitive Systems(Springer) (2007), pp. 251–262
work page 2007
-
[22]
K. H. Ruddock, D. S. Wooding, S. K. Mannan, The relationship between the locations of spatial features and those of fixations made during visual examination of briefly presented images. Spatial vision10(3), 165–188 (1996). 17
work page 1996
-
[23]
B. W. Tatler, The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions.Journal of vision7(14), 4–4 (2007)
work page 2007
-
[24]
L. C. Loschky, A. M. Larson, T. J. Smith, J. P. Magliano, The scene perception & event comprehension theory (SPECT) applied to visual narratives.Topics in Cognitive Science 12(1), 311–351 (2020)
work page 2020
-
[25]
M. I. Coco, F. Keller, Scan patterns predict sentence production in the cross-modal processing of visual scenes.Cognitive science36(7), 1204–1223 (2012)
work page 2012
-
[26]
M. I. Coco, F. Keller, Classification of visual and linguistic tasks using eye-movement features. Journal of Vision14(3), 11–11 (2014)
work page 2014
-
[27]
L. C. Loschky,et al., The role of event understanding in guiding attentional selection in real-world scenes: The scene perception & event comprehension theory (spect).Attention, Perception, & Psychophysics88(3), 92 (2026)
work page 2026
-
[28]
S. Lu,et al., Ovis: Structural Embedding Alignment for Multimodal Large Language Model. arXiv:2405.20797(2024)
-
[29]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Y. Yao,et al., Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
S. Bai,et al., Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
C. Clark,et al., Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
J. S. Perry, W. S. Geisler, Gaze-contingent real-time simulation of arbitrary visual fields, in Human vision and electronic imaging VII(SPIE), vol. 4662 (2002), pp. 57–69
work page 2002
-
[33]
M. P. Eckstein, W. Schoonveld, S. Zhang, S. C. Mack, E. Akbas, Optimal and human eye movements to clustered low value cues to increase decision rewards during search.Vision Research113, 137–154 (2015). 18
work page 2015
- [34]
-
[35]
J. F. Ackermann, M. S. Landy, Choice of saccade endpoint under risk.Journal of vision13(3), 27–27 (2013)
work page 2013
-
[36]
T. H. Weisswange, C. A. Rothkopf, T. Rodemann, J. Triesch, Can reinforcement learning explain the development of causal inference in multisensory integration?, in2009 IEEE 8th International Conference on Development and Learning(IEEE) (2009), pp. 1–7
work page 2009
-
[37]
B. Sullivan, L. Johnson, D. Ballard, M. Hayhoe, A modular reinforcement learning model for human visuomotor behaviour in a driving task, inProc. AISB 2011 Symposium(2011), pp. 33–40
work page 2011
-
[38]
Mnih,et al., Human-level control through deep reinforcement learning.nature518(7540), 529–533 (2015)
V. Mnih,et al., Human-level control through deep reinforcement learning.nature518(7540), 529–533 (2015)
work page 2015
-
[39]
W. Zhou, M. P. Eckstein, A deep Q-learning method for optimizing visual search strategies in backgrounds of dynamic noise, inMedical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment(SPIE), vol. 12035 (2022), pp. 60–67
work page 2022
-
[40]
M. K¨ ummerer, M. Bethge, T. S. Wallis, DeepGaze III: Modeling free-viewing human scanpaths with deep learning.Journal of Vision22(5), 7–7 (2022)
work page 2022
-
[41]
G. Cartella,et al., Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction, inProceedings of the IEEE/CVF International Conference on Computer Vision (2025), pp. 16206–16216
work page 2025
- [42]
-
[43]
M. D. Anderson, E. W. Graf, J. H. Elder, K. A. Ehinger, W. J. Adams, Category systems for real-world scenes.Journal of vision21(2), 8–8 (2021). 19
work page 2021
- [44]
-
[45]
L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence20(11), 1254–1259 (1998), conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/34.730558
- [46]
- [47]
-
[48]
Microsoft COCO: Common Objects in Context
T.-Y. Lin,et al., Microsoft COCO: Common Objects in Context.arXiv:1405.0312 [cs](2015), arXiv: 1405.0312,http://arxiv.org/abs/1405.0312
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[49]
M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2D Human Pose Estimation: New Bench- mark and State of the Art Analysis, inIEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2014)
work page 2014
-
[50]
L. Mertens, E. Yargholi, H. Op de Beeck, J. Van den Stock, J. Vennekens, Findingemo: An image dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems37, 4956–4996 (2024)
work page 2024
-
[51]
A. D. Clarke, B. W. Tatler, Deriving an appropriate baseline for describing fixation behaviour. Vision research102, 41–51 (2014)
work page 2014
-
[52]
S. Murlidaran, Z. Wen, J. Skaza, W. Wang, M. P. Eckstein, Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps (2025)
work page 2025
- [53]
-
[54]
G. Rosenberg, S. Stadhard, B. C. Hansen, M. R. Greene, The Limits of Learning from Pic- tures and Text: Vision-Language Models and Embodied Scene Understanding.arXiv preprint arXiv:2603.26589(2026)
-
[55]
F. A. Wichmann, R. Geirhos, Are deep neural networks adequate behavioral models of human visual perception?Annual review of vision science9(1), 501–524 (2023)
work page 2023
-
[56]
M. Martelli, N. J. Majaj, D. G. Pelli, Are faces processed like words? A diagnostic test for recognition by parts.Journal of Vision5(1), 6–6 (2005)
work page 2005
- [57]
-
[58]
Z. Ouyang, Image Foveation Python: Python implementation of image foveation,https: //github.com/ouyangzhibo/Image_Foveation_Python(2018), gitHub repository, ac- cessed: 2018
work page 2018
-
[59]
S. Lu,et al., Ovis2.5 Technical Report.arXiv:2508.11737(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Dehghani,et al., Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution
M. Dehghani,et al., Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems36, 2252–2274 (2023)
work page 2023
-
[61]
A. Yang,et al., Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024
D. Zhang, J. Li, Z. Zeng, F. Wang, Jasper and stella: distillation of sota embedding models. arXiv preprint arXiv:2412.19048(2024)
-
[63]
R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning8(3), 229–256 (1992)
work page 1992
-
[64]
Kwon,vLLM: An Efficient Inference Engine for Large Language Models, Ph.D
W. Kwon,vLLM: An Efficient Inference Engine for Large Language Models, Ph.D. thesis, UC Berkeley (2025)
work page 2025
-
[65]
I. Oruc, F. Shafai, S. Murthy, P. Lages, T. Ton, The adult face-diet: A naturalistic observation study.Vision research157, 222–229 (2019). 21
work page 2019
-
[66]
A. Linardos, M. K¨ ummerer, O. Press, M. Bethge, DeepGaze IIE: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling, inProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp. 12919–12928
work page 2021
-
[67]
K¨ ummerer, Deep Gaze ii,https://github.com/matthias-k/DeepGaze(2016)
M. K¨ ummerer, Deep Gaze ii,https://github.com/matthias-k/DeepGaze(2016)
work page 2016
-
[68]
K¨ ummerer, Saleincy Models Implementation,https://github.com/matthias-k/ pysaliency(2015)
M. K¨ ummerer, Saleincy Models Implementation,https://github.com/matthias-k/ pysaliency(2015)
work page 2015
-
[69]
SAM 3: Segment Anything with Concepts
N. Carion,et al., Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [70]
-
[71]
J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, Q. Zhao, Predicting human gaze beyond pixels. Journal of Vision14(1), 28–28 (2014). Acknowledgments We would like to thank Professor William Wang of the Computer Science Department for his invaluable support and guidance throughout this project. We also extend our gratitude to our lab members, Srijita, Parsa, a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.