Modeling Subjective Urban Perception with Human Gaze
Pith reviewed 2026-05-09 19:09 UTC · model grok-4.3
The pith
Gaze data improves predictions of subjective urban perception from street view images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gaze alone already carries useful predictive signals for subjective urban perception, and integrating gaze with scene representations further improves prediction under both semantic and richer visual representations.
What carries the argument
The Gaze-Guided Urban Perception Framework, which tests gaze-only modeling and gaze fusion with semantic and visual scene representations to predict perception labels.
Load-bearing premise
The eye-tracking recordings accurately capture the perceptual processes that viewers use to form their subjective urban perception judgments.
What would settle it
Running the same prediction experiments on a new, independent dataset of eye-tracked street views where adding gaze information fails to improve accuracy over image-only baselines.
Figures
read the original abstract
Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Place Pulse-Gaze dataset, which augments street-view images with synchronized eye-tracking recordings and per-participant subjective perception labels (e.g., safety, liveliness). It proposes a Gaze-Guided Urban Perception Framework that evaluates three settings—gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations—and reports that gaze alone supplies useful predictive signals while fusion yields further gains.
Significance. If the quantitative results and controls hold, the work is significant for shifting urban perception modeling from purely image-based approaches to ones that explicitly incorporate human perceptual processes via gaze. The new dataset is a concrete contribution that can support follow-on research on multimodal urban computing and human-aligned scene understanding.
major comments (2)
- [Experiments] Experiments section: no comparison is reported against standard bottom-up saliency models (e.g., Itti-Koch or modern deep saliency predictors) as a control. Without this, it is impossible to determine whether the reported predictive power of gaze-only and fusion models arises from signals specific to subjective perception judgments or from generic image-content correlations that any saliency map would capture.
- [§3 and §4] §3 (Dataset) and §4 (Framework): the description of the eye-tracking protocol and label-collection procedure does not include per-attribute alignment analysis or controls that would verify that fixation patterns are driven by the higher-level attributes being labeled rather than low-level visual features. This directly affects the validity of the central claim that gaze data models the formation of subjective judgments.
minor comments (2)
- [Abstract] The abstract states positive outcomes across three settings but supplies no numerical metrics, error bars, or baseline comparisons; these should be added for immediate readability.
- [§4] Notation for the fusion modules (semantic vs. visual) is introduced without an explicit equation or diagram showing how gaze features are combined with scene features; a small schematic would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the specificity of our gaze-based signals and strengthen the connection between gaze patterns and high-level attributes. We address each major comment below and propose targeted revisions.
read point-by-point responses
-
Referee: [Experiments] Experiments section: no comparison is reported against standard bottom-up saliency models (e.g., Itti-Koch or modern deep saliency predictors) as a control. Without this, it is impossible to determine whether the reported predictive power of gaze-only and fusion models arises from signals specific to subjective perception judgments or from generic image-content correlations that any saliency map would capture.
Authors: We agree that this control is essential to isolate whether gaze data contributes signals tied to subjective judgments beyond generic visual saliency. In the revised manuscript, we will add comparisons using the Itti-Koch model and a modern deep saliency predictor (e.g., DeepGaze). Saliency maps will be extracted from the street-view images and evaluated both in isolation and fused with scene representations, directly benchmarking against our gaze-only and gaze-fusion results to demonstrate the added value of human gaze. revision: yes
-
Referee: [§3 and §4] §3 (Dataset) and §4 (Framework): the description of the eye-tracking protocol and label-collection procedure does not include per-attribute alignment analysis or controls that would verify that fixation patterns are driven by the higher-level attributes being labeled rather than low-level visual features. This directly affects the validity of the central claim that gaze data models the formation of subjective judgments.
Authors: We acknowledge that explicit per-attribute alignment analysis would better validate that gaze reflects high-level subjective attributes rather than low-level features. The original submission did not include such post-hoc analysis. In the revision, we will expand §3 and §4 with new analysis correlating fixation patterns (e.g., duration and spatial distribution) with individual attribute labels across participants, and we will incorporate controls for low-level features by referencing the saliency model comparisons added to the experiments. This will directly support the claim that gaze models subjective judgment formation. revision: yes
Circularity Check
No circularity: claims rest on new empirical dataset and experimental comparisons
full rationale
The paper collects a new Place Pulse-Gaze dataset pairing street-view images with synchronized eye-tracking and perception labels, then evaluates three modeling settings (gaze-only, semantic fusion, visual fusion) via reported performance metrics. No derivation chain, equations, or fitted parameters are defined in terms of the target predictions; results are presented as direct outcomes of training and testing on the held-out data. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known empirical patterns are merely renamed. The central claims therefore remain externally falsifiable through the released dataset and models rather than reducing to their own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Andreas Bulling, Jamie A Ward, Hans Gellersen, and Gerhard Tröster. 2010. Eye movement analysis for activity recognition using electrooculography.IEEE Transactions on Pattern Analysis and Machine Intelligence33, 4 (2010), 741–753
work page 2010
-
[2]
Patrick Cavanagh. 2011. Visual cognition.Vision Research51, 13 (2011), 1538– 1551
work page 2011
-
[3]
Vania Ceccato, Yuhao Kang, Jonatan Abraham, Per Näsman, Fábio Duarte, Song Gao, Lukas Ljungqvist, Fan Zhang, and Carlo Ratti. 2026. What makes a place safe? Assessing AI-generated safety perception scores using Stockholm’s street view images.The British Journal of Criminology66, 2 (2026), 265–289
work page 2026
-
[4]
Lin Che, Yizi Chen, Tanhua Jin, Martin Raubal, Konrad Schindler, and Peter Kiefer. 2025. Unsupervised urban land use mapping with street view contrastive clustering and a geographical prior. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 28–38
work page 2025
-
[5]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794
work page 2016
-
[6]
Xianyu Chen, Ming Jiang, and Qi Zhao. 2021. Predicting human scanpaths in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10876–10885
work page 2021
-
[7]
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmen- tation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1290–1299
work page 2022
-
[8]
Deborah Cohen, Suzanne Spear, Richard Scribner, Patty Kissinger, Karen Mason, and John Wildgen. 2000. “Broken Windows” and the risk of gonorrhea.American Journal of Public Health90, 2 (2000), 230
work page 2000
-
[9]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. Modeling Subjective Urban Perception with Human Gaze The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213–3223
work page 2016
-
[10]
Freya Crosby and Frouke Hermens. 2019. Does it look safe? An eye tracking study into the visual aspects of fear of crime.Quarterly Journal of Experimental Psychology72, 3 (2019), 599–615
work page 2019
-
[11]
Payam Dadvand, Xavier Bartoll, Xavier Basagaña, Albert Dalmau-Bueno, David Martinez, Albert Ambros, Marta Cirach, Margarita Triguero-Mas, Mireia Gascon, Carme Borrell, et al . 2016. Green spaces and general health: roles of mental health status, social support, and physical activity.Environment International91 (2016), 161–167
work page 2016
-
[12]
Liangyang Dai, Chenglong Zheng, Zekai Dong, Yao Yao, Ruifan Wang, Xiaotong Zhang, Shuliang Ren, Jiaqi Zhang, Xiaoqing Song, and Qingfeng Guan. 2021. Analyzing the correlation between visual space and residents’ psychology in Wuhan, China using street-view images and deep-learning technique.City and Environment Interactions11 (2021), 100069
work page 2021
-
[13]
Ap Dijksterhuis and John A Bargh. 2001. The perception-behavior express- way: Automatic effects of social perception on social behavior. InAdvances in Experimental Social Psychology. Vol. 33. Elsevier, 1–40
work page 2001
-
[14]
Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, and César A Hidalgo. 2016. Deep learning the city: Quantifying urban perception at a global scale. InProceedings of the European Conference on Computer Vision. 196–212
work page 2016
-
[15]
2017.Eye tracking methodology: Theory and practice
Andrew T Duchowski. 2017.Eye tracking methodology: Theory and practice. Springer
work page 2017
-
[16]
Kaiqun Fu, Zhiqian Chen, and Chang-Tien Lu. 2018. Streetnet: preference learn- ing with convolutional neural network on urban crime perception. InProceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 269–278
work page 2018
-
[17]
Paul H Gobster and Lynne M Westphal. 2004. The human dimensions of urban greenways: planning for recreation and related experiences.Landscape and Urban Planning68, 2-3 (2004), 147–165
work page 2004
-
[18]
John M Henderson. 2003. Human gaze control during real-world scene perception. Trends in Cognitive Sciences7, 11 (2003), 498–504
work page 2003
- [19]
-
[20]
John M Henderson, Svetlana V Shinkareva, Jing Wang, Steven G Luke, and Jenn Olejarczyk. 2013. Predicting cognitive state from eye movements.PLOS ONE8, 5 (2013), e64937
work page 2013
-
[21]
Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. 2024. Global Streetscapes-A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics.ISPRS Journal of Photogrammetry and Remote Sensing215 (2024), 216–238
work page 2024
-
[22]
Koichi Ito, Yuhao Kang, Ye Zhang, Fan Zhang, and Filip Biljecki. 2024. Under- standing urban perception with visual data: A systematic review.Cities152 (2024), 105169
work page 2024
-
[23]
Yuhao Kang, Junda Chen, Liu Liu, Kshitij Sharma, Martina Mazzarello, Simone Mora, Fábio Duarte, and Carlo Ratti. 2026. Decoding human safety perception with eye-tracking systems, street view images, and explainable AI.Computers, Environment and Urban Systems123 (2026), 102356
work page 2026
-
[24]
Yuhao Kang, Fan Zhang, Song Gao, Hui Lin, and Yu Liu. 2020. A review of urban physical environment sensing using street view imagery in public health studies. Annals of GIS26, 3 (2020), 261–275
work page 2020
-
[25]
George L Kelling and James Q Wilson. 1982. Broken windows.Atlantic Monthly 249, 3 (1982), 29–38
work page 1982
-
[26]
Peter Kiefer, Ioannis Giannopoulos, Martin Raubal, and Andrew Duchowski
-
[27]
Spatial Cognition & Computation17, 1-2 (2017), 1–19
Eye tracking for spatial research: Cognition, computation, challenges. Spatial Cognition & Computation17, 1-2 (2017), 1–19
work page 2017
-
[28]
Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. 2020. Captum: A unified and generic model interpretability library for PyTorch.arXiv preprint arXiv:2009.07896(2020)
-
[29]
Ian Krajbich, Carrie Armel, and Antonio Rangel. 2010. Visual fixations and the computation and comparison of value in simple choice.Nature Neuroscience13, 10 (2010), 1292–1298
work page 2010
-
[30]
Krzysztof Krejtz, Andrew T Duchowski, Anna Niedzielska, Cezary Biele, and Izabela Krejtz. 2018. Eye tracking cognitive load using pupil diameter and microsaccades with fixed gaze.PLOS ONE13, 9 (2018), e0203629
work page 2018
-
[31]
2018.Content analysis: An introduction to its methodology
Klaus Krippendorff. 2018.Content analysis: An introduction to its methodology. SAGE Publications
work page 2018
-
[32]
Yuki Kubota, Kota Tsubouchi, Soto Anno, Kaito Ide, and Masamichi Shimosaka
-
[33]
InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems
Omni-CityMood: Vision-based urban atmosphere perception from every angle. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 186–196
-
[34]
Jie Li, Zhonghao Zhang, Fu Jing, Jun Gao, Jianyu Ma, Guofan Shao, and Scott Noel. 2020. An evaluation of urban green space in Shanghai, China, using eye tracking.Urban Forestry & Urban Greening56 (2020), 126903
work page 2020
-
[35]
Yin Li, Miao Liu, and James M Rehg. 2021. In the eye of the beholder: Gaze and actions in first person video.IEEE Transactions on Pattern Analysis and Machine Intelligence45, 6 (2021), 6731–6747
work page 2021
-
[36]
Yunqin Li, Nobuyoshi Yabuki, and Tomohiro Fukuda. 2023. Integrating GIS, deep learning, and environmental sensors for multicriteria evaluation of urban street walkability.Landscape and Urban Planning230 (2023), 104603
work page 2023
-
[37]
Dillon Lohr and Oleg V Komogortsev. 2022. Eye know you too: Toward viable end-to-end eye movement biometrics for user authentication.IEEE Transactions on Information Forensics and Security17 (2022), 3151–3164
work page 2022
- [38]
-
[39]
Bhanuka Mahanama, Yasith Jayawardana, Sundararaman Rengarajan, Gavindya Jayawardena, Leanne Chukoskie, Joseph Snider, and Sampath Jayarathna. 2022. Eye movement and pupil measures: A review.Frontiers in Computer Science3 (2022), 733531
work page 2022
-
[40]
Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, and Shuqiang Jiang. 2019. Multi-task deep relative attribute learning for visual urban perception.IEEE Transactions on Image Processing29 (2019), 657–669
work page 2019
-
[41]
Daniel R. Montello and Martin Raubal. 2013. Functions and applications of spatial cognition. InHandbook of Spatial Cognition, David Waller and Lynn Nadel (Eds.). American Psychological Association, Washington, DC, 249–264
work page 2013
-
[42]
Felipe Moreno-Vera, Bahram Lavi, and Jorge Poco. 2021. Quantifying urban safety perception on street view images. InProceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. 611–616
work page 2021
-
[43]
Nikhil Naik, Jade Philipoom, Ramesh Raskar, and César Hidalgo. 2014. Streetscore-predicting the perceived safety of one million streetscapes. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 779–785
work page 2014
-
[44]
Jack L Nasar. 1990. The evaluative image of the city.Journal of the American Planning Association56, 1 (1990), 41–53
work page 1990
-
[45]
Jakub Štěpán Novák, Jan Masner, Petr Benda, Pavel Šimek, and Vojtěch Merunka
-
[46]
Eye tracking, usability, and user experience: A systematic review.Interna- tional Journal of Human–Computer Interaction40, 17 (2024), 4484–4500
work page 2024
-
[47]
Süleyman Özdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, and Enkelejda Kasneci. 2024. Gaze-guided graph neural network for action antici- pation conditioned on intention. InProceedings of the 2024 Symposium on Eye Tracking Research and Applications. 1–9
work page 2024
-
[48]
Ilias O Pappas, Kshitij Sharma, Patrick Mikalef, and Michail N Giannakos. 2020. How quickly can we predict users’ ratings on aesthetic evaluations of websites? Employing machine learning on eye-tracking data. InConference on e-Business, e-Services and e-Society. 429–440
work page 2020
-
[49]
Yunmi Park and Max Garcia. 2020. Pedestrian safety perception and urban street settings.International Journal of Sustainable Transportation14, 11 (2020), 860–871
work page 2020
-
[50]
Lorenzo Porzi, Samuel Rota Bulò, Bruno Lepri, and Elisa Ricci. 2015. Predicting and understanding urban perception with convolutional neural networks. In Proceedings of the 23rd ACM International Conference on Multimedia. 139–148
work page 2015
-
[51]
Matias Quintana, Youlong Gu, and Filip Biljecki. 2024. My street is better than your street: Towards data-driven urban planning with visual perception. In Proceedings of the 11th ACM International Conference on Systems for Energy- Efficient Buildings, Cities, and Transportation. 221–222
work page 2024
-
[52]
Matias Quintana, Youlong Gu, Xiucheng Liang, Yujun Hou, Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, and Filip Biljecki. 2025. Global urban visual perception varies across demographics and personalities.Nature Cities(2025), 1–15
work page 2025
-
[53]
Keith Rayner. 2009. Eye movements and attention in reading, scene perception, and visual search.The Quarterly Journal of Experimental Psychology62, 8 (2009), 1457–1506
work page 2009
-
[54]
Catherine E Ross and John Mirowsky. 2001. Neighborhood disadvantage, disorder, and health.Journal of Health and Social Behavior42, 3 (2001), 258–276
work page 2001
-
[55]
Philip Salesses, Katja Schechtner, and César A Hidalgo. 2013. The collaborative image of the city: mapping the inequality of urban perception.PLOS ONE8, 7 (2013), e68400
work page 2013
-
[56]
Dario D Salvucci and Joseph H Goldberg. 2000. Identifying fixations and saccades in eye-tracking protocols. InProceedings of the 2000 Symposium on Eye Tracking Research & Applications. 71–78
work page 2000
-
[57]
Abdulrahman Mohamed Selim, Michael Barz, Omair Shahzad Bhatti, Hasan Md Tusfiqur Alam, and Daniel Sonntag. 2024. A review of machine learning in scanpath analysis for passive gaze-based interaction.Frontiers in Artificial Intelligence7 (2024), 1391745
work page 2024
-
[58]
Shinsuke Shimojo, Claudiu Simion, Eiko Shimojo, and Christian Scheier. 2003. Gaze bias both reflects and influences preference.Nature Neuroscience6, 12 (2003), 1317–1322
work page 2003
-
[59]
Harshinee Sriram, Cristina Conati, and Thalia Field. 2023. Classification of Alzheimer’s disease with deep learning on eye-tracking data. InProceedings of the 25th International Conference on Multimodal Interaction. 104–113
work page 2023
-
[60]
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Che et al. Learning. 3319–3328
work page 2017
-
[61]
Arash Tavakoli, Isabella P Douglas, Hae Young Noh, Jackelyn Hwang, and Sarah L Billington. 2025. Psycho-behavioral responses to urban scenes: An exploration through eye-tracking.Cities156 (2025), 105568
work page 2025
-
[62]
Tobii. 2025. Tobii Pro Spectrum. https://www.tobii.com/products/eye-trackers/ screen-based/tobii-pro-spectrum Accessed 2026-03-21
work page 2025
-
[63]
Deltcho Valtchanov and Colin G Ellard. 2015. Cognitive and affective responses to natural scenes: Effects of low level visual properties on preference, cognitive load and eye-movements.Journal of Environmental Psychology43 (2015), 184–195
work page 2015
-
[64]
Lei Wang, Xin Han, Jie He, and Taeyeol Jung. 2022. Measuring residents’ percep- tions of city streets to inform better street planning through deep learning and space syntax.ISPRS Journal of Photogrammetry and Remote Sensing190 (2022), 215–230
work page 2022
-
[65]
Ruili Wang, Fan Yang, and Qingqin Wang. 2025. Emotion-based design research of rural street spaces using eye-tracking technology: A case study of Huixingtou Village in Handan City.PLOS ONE20, 6 (2025), e0326049
work page 2025
-
[66]
Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. 2021. Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data. InProceed- ings of the IEEE/CVF International Conference on Computer Vision. 1905–1914
work page 2021
-
[67]
Zeyu Wang, Koichi Ito, and Filip Biljecki. 2024. Assessing the equity and evolution of urban visual perceptual quality with time series street view imagery.Cities 145 (2024), 104704
work page 2024
-
[68]
Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual transformers: Token-based image representation and processing for com- puter vision.arXiv preprint arXiv:2006.03677(2020)
-
[69]
Nai Yang, Zhitao Deng, Fangtai Hu, Yi Chao, Lin Wan, Qingfeng Guan, and Zhiwei Wei. 2024. Urban perception by using eye movement data on street view images.Transactions in GIS28, 5 (2024), 1021–1042
work page 2024
-
[70]
Yao Yao, Zhaotang Liang, Zehao Yuan, Penghua Liu, Yongpan Bie, Jinbao Zhang, Ruoyu Wang, Jiale Wang, and Qingfeng Guan. 2019. A human-machine adver- sarial scoring framework for urban perception assessment using street-view images.International Journal of Geographical Information Science33, 12 (2019), 2363–2384
work page 2019
-
[71]
A. L. Yarbus. 1967.Eye Movements and Vision. Springer. Modeling Subjective Urban Perception with Human Gaze Appendix A Dataset and Analysis A.1 Inter-rater Variability Distribution Figure 5 provides the full Distribution of Mean Pairwise Distance distributions for the three perception dimensions, complementing the discussion in Sec. 3.3. 0.00 0.25 0.50 0....
work page 1967
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.