pith. sign in

arxiv: 2605.17823 · v1 · pith:EZTU4UCEnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

Pith reviewed 2026-05-20 12:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords foveated visioneye fixationsscene understandingvisual language modelhuman-like attentionfree-viewingemergent patternscomputational modeling
0
0 comments X

The pith

A foveated visual language model trained only to maximize scene understanding develops human-like eye fixation patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether human free-viewing eye fixations emerge from the goal of understanding scenes given the limits of foveated vision. It introduces a computational agent that uses simulated foveation and is trained to comprehend entire scenes. This agent produces fixation sequences that closely resemble those of human observers, including starting at the center and then moving to people, text, and important objects. Agents trained on different goals like searching for items or classifying the scene, or with mismatched peripheral vision, do not match human patterns as well. This matters because it offers a functional explanation for why we look where we look without assuming explicit tasks.

Core claim

When a visual language model with simulated foveation is trained to optimize scene comprehension, it develops fixation patterns that match human free-viewing behavior, such as initial central fixations followed by attention to people, text, objects being gazed at or grasped, and semantically meaningful regions. Models trained instead to search or classify scenes, or given peripheral vision that is better or worse than human vision, match human fixations less accurately. The paper concludes that these human-like patterns can emerge as a byproduct of pursuing scene understanding under the constraints of foveated vision.

What carries the argument

The scene-comprehension-trained foveated visual language model, which simulates human-like peripheral vision degradation and learns fixation policies to maximize overall scene understanding.

If this is right

  • Fixation patterns in free viewing arise specifically from the scene comprehension objective rather than from search or classification tasks.
  • The accuracy of the peripheral vision simulation is critical for producing human-like fixations.
  • Human eye movements may serve to optimize information gathering for scene understanding rather than other perceptual goals.
  • These emergent patterns can be reproduced in artificial systems without direct supervision on eye movement data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that attention mechanisms in AI could be improved by incorporating similar optimization objectives and biological constraints.
  • Similar emergent behaviors might appear in other sensory modalities or tasks when biological limits are modeled.
  • Testing the model on dynamic scenes or with added eye movement costs could further validate or refine the predictions.

Load-bearing premise

The assumption that optimizing for scene comprehension is the primary functional goal driving human free-viewing fixations and that the simulated foveation and model architecture adequately capture the relevant biological constraints.

What would settle it

A direct test would be to compare the fixation predictions of the scene-comprehension model against human data on a large set of novel scenes; if the match is no better than models trained on other objectives, the claim would be weakened.

read the original abstract

When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that a foveated visual language model trained to optimize scene comprehension produces emergent human-like fixation patterns during free viewing, including initial central bias and subsequent fixations on people, text, objects, and semantically meaningful regions. Direct ablations show that this objective yields better matches to human data than training for search or classification, or using peripheral vision parameters that deviate from human foveation.

Significance. If the results hold, the work supplies a functional hypothesis for human free-viewing fixations as a byproduct of maximizing scene understanding under biological foveation constraints. The explicit ablations on training objectives and foveation parameters provide concrete evidence distinguishing the scene-comprehension account from alternatives. This strengthens links between computational models of attention and biological vision and offers a testable framework for designing foveated vision systems in AI. The architecture, reward formulation, and evaluation metrics are specified in sufficient detail to support verification of the emergence finding.

minor comments (2)
  1. Abstract: the summary of comparative results would be strengthened by a single sentence noting the model family, the source of human fixation data, and the primary quantitative metric used for matching.
  2. Figure captions and results section: ensure every ablation condition is explicitly labeled and that statistical comparisons (e.g., p-values or confidence intervals) between the scene-comprehension model and baselines are reported for each key fixation signature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation for minor revision. We appreciate the accurate characterization of our contributions regarding emergent human-like fixations from scene-comprehension training under foveated constraints.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via independent ablations and held-out data

full rationale

The paper's central claim rests on training a foveated model under a scene-comprehension objective and directly comparing the resulting fixation patterns to human data via ablations against search/classify objectives and non-human foveation parameters. No equation or result is shown to reduce by construction to a fitted parameter or self-citation; the match to human fixations is evaluated on independent measurements rather than the training data itself. The architecture, reward formulation, and metrics are specified with sufficient detail to stand alone, and the functional interpretation is framed as a hypothesis supported by the empirical contrasts rather than an imported uniqueness theorem or ansatz. This yields a self-contained empirical finding without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; model training necessarily involves many implicit hyperparameters and architectural choices whose values are not reported.

pith-pipeline@v0.9.0 · 5674 in / 1042 out tokens · 24595 ms · 2026-05-20T12:20:48.993618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 7 internal anchors

  1. [1]

    M. P. Eckstein, B. R. Beutter, L. S. Stone, Quantifying the performance limits of human saccadic targeting during visual search.Perception30(11), 1389–1401 (2001)

  2. [2]

    Najemnik, W

    J. Najemnik, W. S. Geisler, Optimal eye movement strategies in visual search.Nature 434(7031), 387–391 (2005), doi:10.1038/nature03390

  3. [3]

    G. L. Malcolm, J. M. Henderson, Combining top-down processes to guide eye movements during real-world scene search.Journal of Vision10(2), 4–4 (2010)

  4. [4]

    Hoppe, C

    D. Hoppe, C. A. Rothkopf, Multi-step planning of eye movements in visual search.Scientific reports9(1), 144 (2019)

  5. [5]

    M. F. Peterson, M. P. Eckstein, Looking just below the eyes is optimal across face recognition tasks.Proceedings of the National Academy of Sciences109(48), E3314–E3323 (2012)

  6. [6]

    K. R. Gegenfurtner, The interaction between vision and eye movements.Perception45(12), 1333–1357 (2016)

  7. [7]

    A. J. de Brouwer, J. R. Flanagan, M. Spering, Functional use of eye movements for an acting system.Trends in Cognitive Sciences25(3), 252–263 (2021)

  8. [8]

    M. M. Hayhoe, R. A. Lerch, Visual Guidance of Natural Behavior (2022), doi:10. 1093/acrefore/9780190236557.013.848,https://oxfordre.com/psychology/view/10. 1093/acrefore/9780190236557.001.0001/acrefore-9780190236557-e-848

  9. [9]

    C. A. Rothkopf, M. M. Hayhoe, Computational elements of natural vision.Journal of Vision 25(12), 4–4 (2025)

  10. [10]

    Einh ¨auser, M

    W. Einh ¨auser, M. Spain, P. Perona, Objects predict fixations better than early saliency.Journal of vision8(14), 18–18 (2008)

  11. [11]

    Koehler, F

    K. Koehler, F. Guo, S. Zhang, M. P. Eckstein, What do saliency models predict?Journal of vision14(3), 14–14 (2014). 16

  12. [12]

    J. M. Henderson, T. R. Hayes, Meaning-based guidance of attention in scenes as revealed by meaning maps.Nature human behaviour1(10), 743–747 (2017)

  13. [13]

    C. E. Peacock, T. R. Hayes, J. M. Henderson, The role of meaning in attentional guidance during free viewing of real-world scenes.Acta Psychologica198, 102889 (2019)

  14. [14]

    Murlidaran, M

    S. Murlidaran, M. P. Eckstein, Eye movements during free viewing to maximize scene under- standing.Nature Communications17(1) (2025), doi:10.1038/s41467-025-67673-w

  15. [15]

    M. Cerf, E. P. Frady, C. Koch, Faces and text attract gaze independent of the task: Experimental data and computer model.Journal of Vision9(12), 10–10 (2009)

  16. [16]

    Birmingham, W

    E. Birmingham, W. F. Bischof, A. Kingstone, Gaze selection in complex social scenes.Visual cognition16(2-3), 341–355 (2008)

  17. [17]

    H.-C. Wang, M. Pomplun, The attraction of visual attention to texts in real-world scenes. Journal of vision12(6), 26–26 (2012)

  18. [18]

    De Haas, A

    B. De Haas, A. L. Iakovidis, D. S. Schwarzkopf, K. R. Gegenfurtner, Individual differences in visual salience vary along semantic dimensions.Proceedings of the National Academy of Sciences116(24), 11687–11692 (2019)

  19. [19]

    Nuthmann, J

    A. Nuthmann, J. M. Henderson, Object-based attentional selection in scene viewing.Journal of vision10(8), 20–20 (2010)

  20. [20]

    M. Land, N. Mennie, J. Rusted, The roles of vision and eye movements in the control of activities of daily living.Perception28(11), 1311–1328 (1999)

  21. [21]

    M. S. Castelhano, M. Wieth, J. M. Henderson, I see what you see: Eye movements in real-world scenes are affected by perceived direction of gaze, inInternational Workshop on Attention in Cognitive Systems(Springer) (2007), pp. 251–262

  22. [22]

    K. H. Ruddock, D. S. Wooding, S. K. Mannan, The relationship between the locations of spatial features and those of fixations made during visual examination of briefly presented images. Spatial vision10(3), 165–188 (1996). 17

  23. [23]

    B. W. Tatler, The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions.Journal of vision7(14), 4–4 (2007)

  24. [24]

    L. C. Loschky, A. M. Larson, T. J. Smith, J. P. Magliano, The scene perception & event comprehension theory (SPECT) applied to visual narratives.Topics in Cognitive Science 12(1), 311–351 (2020)

  25. [25]

    M. I. Coco, F. Keller, Scan patterns predict sentence production in the cross-modal processing of visual scenes.Cognitive science36(7), 1204–1223 (2012)

  26. [26]

    M. I. Coco, F. Keller, Classification of visual and linguistic tasks using eye-movement features. Journal of Vision14(3), 11–11 (2014)

  27. [27]

    L. C. Loschky,et al., The role of event understanding in guiding attentional selection in real-world scenes: The scene perception & event comprehension theory (spect).Attention, Perception, & Psychophysics88(3), 92 (2026)

  28. [28]

    Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

    S. Lu,et al., Ovis: Structural Embedding Alignment for Multimodal Large Language Model. arXiv:2405.20797(2024)

  29. [29]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Y. Yao,et al., Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800 (2024)

  30. [30]

    Qwen3-VL Technical Report

    S. Bai,et al., Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  31. [31]

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    C. Clark,et al., Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611(2026)

  32. [32]

    J. S. Perry, W. S. Geisler, Gaze-contingent real-time simulation of arbitrary visual fields, in Human vision and electronic imaging VII(SPIE), vol. 4662 (2002), pp. 57–69

  33. [33]

    M. P. Eckstein, W. Schoonveld, S. Zhang, S. C. Mack, E. Akbas, Optimal and human eye movements to clustered low value cues to increase decision rewards during search.Vision Research113, 137–154 (2015). 18

  34. [34]

    Zhang, M

    S. Zhang, M. P. Eckstein, Evolution and optimality of similar neural mechanisms for perception and action during search.PLoS Computational Biology6(9), e1000930 (2010)

  35. [35]

    J. F. Ackermann, M. S. Landy, Choice of saccade endpoint under risk.Journal of vision13(3), 27–27 (2013)

  36. [36]

    T. H. Weisswange, C. A. Rothkopf, T. Rodemann, J. Triesch, Can reinforcement learning explain the development of causal inference in multisensory integration?, in2009 IEEE 8th International Conference on Development and Learning(IEEE) (2009), pp. 1–7

  37. [37]

    Sullivan, L

    B. Sullivan, L. Johnson, D. Ballard, M. Hayhoe, A modular reinforcement learning model for human visuomotor behaviour in a driving task, inProc. AISB 2011 Symposium(2011), pp. 33–40

  38. [38]

    Mnih,et al., Human-level control through deep reinforcement learning.nature518(7540), 529–533 (2015)

    V. Mnih,et al., Human-level control through deep reinforcement learning.nature518(7540), 529–533 (2015)

  39. [39]

    W. Zhou, M. P. Eckstein, A deep Q-learning method for optimizing visual search strategies in backgrounds of dynamic noise, inMedical Imaging 2022: Image Perception, Observer Performance, and Technology Assessment(SPIE), vol. 12035 (2022), pp. 60–67

  40. [40]

    K¨ ummerer, M

    M. K¨ ummerer, M. Bethge, T. S. Wallis, DeepGaze III: Modeling free-viewing human scanpaths with deep learning.Journal of Vision22(5), 7–7 (2022)

  41. [41]

    G. Cartella,et al., Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction, inProceedings of the IEEE/CVF International Conference on Computer Vision (2025), pp. 16206–16216

  42. [42]

    Assens, X

    M. Assens, X. Giro-i Nieto, K. McGuinness, N. E. O’Connor, PathGAN: Visual scanpath prediction with generative adversarial networks, inProceedings of the European Conference on Computer Vision (ECCV) Workshops(2018), pp. 0–0

  43. [43]

    M. D. Anderson, E. W. Graf, J. H. Elder, K. A. Ehinger, W. J. Adams, Category systems for real-world scenes.Journal of vision21(2), 8–8 (2021). 19

  44. [44]

    Harel, C

    J. Harel, C. Koch, P. Perona, Graph-based visual saliency.Advances in Neural Information Processing Systems19(2006)

  45. [45]

    L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence20(11), 1254–1259 (1998), conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/34.730558

  46. [46]

    Linka, H

    M. Linka, H. Karimpur, B. de Haas, Protracted development of gaze behaviour.Nature Human Behaviour9(9), 1887–1897 (2025)

  47. [47]

    Akbas, M

    E. Akbas, M. P. Eckstein, Object detection through search with a foveated visual system.PLoS computational biology13(10), e1005743 (2017)

  48. [48]

    Microsoft COCO: Common Objects in Context

    T.-Y. Lin,et al., Microsoft COCO: Common Objects in Context.arXiv:1405.0312 [cs](2015), arXiv: 1405.0312,http://arxiv.org/abs/1405.0312

  49. [49]

    Andriluka, L

    M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2D Human Pose Estimation: New Bench- mark and State of the Art Analysis, inIEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2014)

  50. [50]

    Mertens, E

    L. Mertens, E. Yargholi, H. Op de Beeck, J. Van den Stock, J. Vennekens, Findingemo: An image dataset for emotion recognition in the wild.Advances in Neural Information Processing Systems37, 4956–4996 (2024)

  51. [51]

    A. D. Clarke, B. W. Tatler, Deriving an appropriate baseline for describing fixation behaviour. Vision research102, 41–51 (2014)

  52. [52]

    Murlidaran, Z

    S. Murlidaran, Z. Wen, J. Skaza, W. Wang, M. P. Eckstein, Semantic Saliency from Multi-Modal Large Language Model Scene Understanding Maps (2025)

  53. [53]

    Tseng, R

    P.-H. Tseng, R. Carmi, I. G. Cameron, D. P. Munoz, L. Itti, Quantifying center bias of observers in free viewing of dynamic natural scenes.Journal of vision9(7), 4–4 (2009). 20

  54. [54]

    Rosenberg, S

    G. Rosenberg, S. Stadhard, B. C. Hansen, M. R. Greene, The Limits of Learning from Pic- tures and Text: Vision-Language Models and Embodied Scene Understanding.arXiv preprint arXiv:2603.26589(2026)

  55. [55]

    F. A. Wichmann, R. Geirhos, Are deep neural networks adequate behavioral models of human visual perception?Annual review of vision science9(1), 501–524 (2023)

  56. [56]

    Martelli, N

    M. Martelli, N. J. Majaj, D. G. Pelli, Are faces processed like words? A diagnostic test for recognition by parts.Journal of Vision5(1), 6–6 (2005)

  57. [57]

    Jiang, S

    M. Jiang, S. Huang, J. Duan, Q. Zhao, SALICON: Saliency in Context, inThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2015)

  58. [58]

    Ouyang, Image Foveation Python: Python implementation of image foveation,https: //github.com/ouyangzhibo/Image_Foveation_Python(2018), gitHub repository, ac- cessed: 2018

    Z. Ouyang, Image Foveation Python: Python implementation of image foveation,https: //github.com/ouyangzhibo/Image_Foveation_Python(2018), gitHub repository, ac- cessed: 2018

  59. [59]

    Ovis2.5 Technical Report

    S. Lu,et al., Ovis2.5 Technical Report.arXiv:2508.11737(2025)

  60. [60]

    Dehghani,et al., Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution

    M. Dehghani,et al., Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems36, 2252–2274 (2023)

  61. [61]

    Qwen3 Technical Report

    A. Yang,et al., Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  62. [62]

    Jasper and stella: distillation of sota embedding models.arXiv preprint arXiv:2412.19048, 2024

    D. Zhang, J. Li, Z. Zeng, F. Wang, Jasper and stella: distillation of sota embedding models. arXiv preprint arXiv:2412.19048(2024)

  63. [63]

    R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning8(3), 229–256 (1992)

  64. [64]

    Kwon,vLLM: An Efficient Inference Engine for Large Language Models, Ph.D

    W. Kwon,vLLM: An Efficient Inference Engine for Large Language Models, Ph.D. thesis, UC Berkeley (2025)

  65. [65]

    I. Oruc, F. Shafai, S. Murthy, P. Lages, T. Ton, The adult face-diet: A naturalistic observation study.Vision research157, 222–229 (2019). 21

  66. [66]

    Linardos, M

    A. Linardos, M. K¨ ummerer, O. Press, M. Bethge, DeepGaze IIE: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling, inProceedings of the IEEE/CVF International Conference on Computer Vision(2021), pp. 12919–12928

  67. [67]

    K¨ ummerer, Deep Gaze ii,https://github.com/matthias-k/DeepGaze(2016)

    M. K¨ ummerer, Deep Gaze ii,https://github.com/matthias-k/DeepGaze(2016)

  68. [68]

    K¨ ummerer, Saleincy Models Implementation,https://github.com/matthias-k/ pysaliency(2015)

    M. K¨ ummerer, Saleincy Models Implementation,https://github.com/matthias-k/ pysaliency(2015)

  69. [69]

    SAM 3: Segment Anything with Concepts

    N. Carion,et al., Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719 (2025)

  70. [70]

    Linka, B

    M. Linka, B. de Haas, OSIEshort: A small stimulus set can reliably estimate individual differ- ences in semantic salience.Journal of vision20(9), 13–13 (2020)

  71. [71]

    Provide your best description of the smallest man-made object present and clearly visible in the image in a sentence. Do not mention the blur seen in the picture

    J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, Q. Zhao, Predicting human gaze beyond pixels. Journal of Vision14(1), 28–28 (2014). Acknowledgments We would like to thank Professor William Wang of the Computer Science Department for his invaluable support and guidance throughout this project. We also extend our gratitude to our lab members, Srijita, Parsa, a...