Modeling Subjective Urban Perception with Human Gaze

Konrad Schindler; Lin Che; Marc Pollefeys; Martin Raubal; Peter Kiefer; Xi Wang

arxiv: 2605.00764 · v1 · submitted 2026-05-01 · 💻 cs.CV · cs.AI· cs.HC

Modeling Subjective Urban Perception with Human Gaze

Lin Che , Xi Wang , Marc Pollefeys , Konrad Schindler , Martin Raubal , Peter Kiefer This is my paper

Pith reviewed 2026-05-09 19:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC

keywords urban perceptioneye trackinggaze behaviorstreet view imagessubjective evaluationmultimodal modelingperception prediction

0 comments

The pith

Gaze data improves predictions of subjective urban perception from street view images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a dataset that pairs street view images with eye-tracking data and subjective perception labels. It introduces a framework to evaluate gaze in three settings: using gaze by itself, combining it with semantic scene info, and combining it with detailed visual features. Experiments reveal that gaze carries independent predictive value for perception and that fusion with either type of scene representation boosts model performance. This matters because it points to the value of modeling the human viewing process rather than treating perception as a direct function of image content alone.

Core claim

Gaze alone already carries useful predictive signals for subjective urban perception, and integrating gaze with scene representations further improves prediction under both semantic and richer visual representations.

What carries the argument

The Gaze-Guided Urban Perception Framework, which tests gaze-only modeling and gaze fusion with semantic and visual scene representations to predict perception labels.

Load-bearing premise

The eye-tracking recordings accurately capture the perceptual processes that viewers use to form their subjective urban perception judgments.

What would settle it

Running the same prediction experiments on a new, independent dataset of eye-tracked street views where adding gaze information fails to improve accuracy over image-only baselines.

Figures

Figures reproduced from arXiv: 2605.00764 by Konrad Schindler, Lin Che, Marc Pollefeys, Martin Raubal, Peter Kiefer, Xi Wang.

**Figure 1.** Figure 1: Significant gaze-only features under one-way ANOVA across perception levels (Low/Neutral/High). Features with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Significant AOI fixation features under one-way ANOVA across perception levels (Low/Neutral/High). The dashed line [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Gaze-guided Urban Perception Framework. Raw gaze recordings are first segmented into [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative attribution comparison of the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of Mean Pairwise Distance (MPD) between participants for the three perception dimensions. Ratings are [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New gaze-augmented dataset for urban perception is the real addition here, but the abstract gives no numbers so the claimed gains are hard to evaluate.

read the letter

The main new element is the Place Pulse-Gaze dataset that pairs street-view images with synchronized eye-tracking recordings and the usual subjective labels. They also lay out a three-setting framework that tests gaze by itself, gaze fused with semantic scene features, and gaze fused with richer visual features. The abstract reports that gaze alone already predicts the labels and that adding it to either type of scene representation improves results further. That is a clear step past the image-only models that have dominated this area so far, and collecting the paired data is useful work that others can build on. The framework itself is systematic and easy to follow from the description. The soft spots are straightforward. The abstract states positive outcomes across the three settings but supplies no metrics, baselines, error bars, or even basic effect sizes, so it is impossible to judge whether the improvements are large enough to matter or whether they survive proper controls. The stress-test point about gaze possibly reflecting low-level saliency rather than the higher-level processes that produce judgments like safety or liveliness is worth taking seriously; without comparisons to non-human saliency maps or per-attribute gaze-label alignment checks, it is unclear whether the gains come from modeling perception or from shared image content. If the full paper has those controls and the numbers, the contribution strengthens; if not, the central claim stays under-supported. This work is aimed at people doing computational urban studies or multimodal scene understanding who already care about human signals. It is worth sending to a serious referee because the dataset is new and the experimental design is laid out cleanly, even though the current evidence is thin and will need close checking on the quantitative side and on whether gaze is truly capturing the intended perceptual processes.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Place Pulse-Gaze dataset, which augments street-view images with synchronized eye-tracking recordings and per-participant subjective perception labels (e.g., safety, liveliness). It proposes a Gaze-Guided Urban Perception Framework that evaluates three settings—gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations—and reports that gaze alone supplies useful predictive signals while fusion yields further gains.

Significance. If the quantitative results and controls hold, the work is significant for shifting urban perception modeling from purely image-based approaches to ones that explicitly incorporate human perceptual processes via gaze. The new dataset is a concrete contribution that can support follow-on research on multimodal urban computing and human-aligned scene understanding.

major comments (2)

[Experiments] Experiments section: no comparison is reported against standard bottom-up saliency models (e.g., Itti-Koch or modern deep saliency predictors) as a control. Without this, it is impossible to determine whether the reported predictive power of gaze-only and fusion models arises from signals specific to subjective perception judgments or from generic image-content correlations that any saliency map would capture.
[§3 and §4] §3 (Dataset) and §4 (Framework): the description of the eye-tracking protocol and label-collection procedure does not include per-attribute alignment analysis or controls that would verify that fixation patterns are driven by the higher-level attributes being labeled rather than low-level visual features. This directly affects the validity of the central claim that gaze data models the formation of subjective judgments.

minor comments (2)

[Abstract] The abstract states positive outcomes across three settings but supplies no numerical metrics, error bars, or baseline comparisons; these should be added for immediate readability.
[§4] Notation for the fusion modules (semantic vs. visual) is introduced without an explicit equation or diagram showing how gaze features are combined with scene features; a small schematic would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the specificity of our gaze-based signals and strengthen the connection between gaze patterns and high-level attributes. We address each major comment below and propose targeted revisions.

read point-by-point responses

Referee: [Experiments] Experiments section: no comparison is reported against standard bottom-up saliency models (e.g., Itti-Koch or modern deep saliency predictors) as a control. Without this, it is impossible to determine whether the reported predictive power of gaze-only and fusion models arises from signals specific to subjective perception judgments or from generic image-content correlations that any saliency map would capture.

Authors: We agree that this control is essential to isolate whether gaze data contributes signals tied to subjective judgments beyond generic visual saliency. In the revised manuscript, we will add comparisons using the Itti-Koch model and a modern deep saliency predictor (e.g., DeepGaze). Saliency maps will be extracted from the street-view images and evaluated both in isolation and fused with scene representations, directly benchmarking against our gaze-only and gaze-fusion results to demonstrate the added value of human gaze. revision: yes
Referee: [§3 and §4] §3 (Dataset) and §4 (Framework): the description of the eye-tracking protocol and label-collection procedure does not include per-attribute alignment analysis or controls that would verify that fixation patterns are driven by the higher-level attributes being labeled rather than low-level visual features. This directly affects the validity of the central claim that gaze data models the formation of subjective judgments.

Authors: We acknowledge that explicit per-attribute alignment analysis would better validate that gaze reflects high-level subjective attributes rather than low-level features. The original submission did not include such post-hoc analysis. In the revision, we will expand §3 and §4 with new analysis correlating fixation patterns (e.g., duration and spatial distribution) with individual attribute labels across participants, and we will incorporate controls for low-level features by referencing the saliency model comparisons added to the experiments. This will directly support the claim that gaze models subjective judgment formation. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new empirical dataset and experimental comparisons

full rationale

The paper collects a new Place Pulse-Gaze dataset pairing street-view images with synchronized eye-tracking and perception labels, then evaluates three modeling settings (gaze-only, semantic fusion, visual fusion) via reported performance metrics. No derivation chain, equations, or fitted parameters are defined in terms of the target predictions; results are presented as direct outcomes of training and testing on the held-out data. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known empirical patterns are merely renamed. The central claims therefore remain externally falsifiable through the released dataset and models rather than reducing to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical computer vision and human-computer interaction paper centered on new data collection and model evaluation. No mathematical axioms, free parameters, or invented entities are invoked in the abstract; the claims depend on experimental outcomes from the introduced dataset.

pith-pipeline@v0.9.0 · 5480 in / 1129 out tokens · 48739 ms · 2026-05-09T19:09:53.029880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

[1]

Andreas Bulling, Jamie A Ward, Hans Gellersen, and Gerhard Tröster. 2010. Eye movement analysis for activity recognition using electrooculography.IEEE Transactions on Pattern Analysis and Machine Intelligence33, 4 (2010), 741–753

work page 2010
[2]

Patrick Cavanagh. 2011. Visual cognition.Vision Research51, 13 (2011), 1538– 1551

work page 2011
[3]

Vania Ceccato, Yuhao Kang, Jonatan Abraham, Per Näsman, Fábio Duarte, Song Gao, Lukas Ljungqvist, Fan Zhang, and Carlo Ratti. 2026. What makes a place safe? Assessing AI-generated safety perception scores using Stockholm’s street view images.The British Journal of Criminology66, 2 (2026), 265–289

work page 2026
[4]

Lin Che, Yizi Chen, Tanhua Jin, Martin Raubal, Konrad Schindler, and Peter Kiefer. 2025. Unsupervised urban land use mapping with street view contrastive clustering and a geographical prior. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 28–38

work page 2025
[5]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794

work page 2016
[6]

Xianyu Chen, Ming Jiang, and Qi Zhao. 2021. Predicting human scanpaths in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10876–10885

work page 2021
[7]

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmen- tation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1290–1299

work page 2022
[8]

Broken Windows

Deborah Cohen, Suzanne Spear, Richard Scribner, Patty Kissinger, Karen Mason, and John Wildgen. 2000. “Broken Windows” and the risk of gonorrhea.American Journal of Public Health90, 2 (2000), 230

work page 2000
[9]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. Modeling Subjective Urban Perception with Human Gaze The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213–3223

work page 2016
[10]

Freya Crosby and Frouke Hermens. 2019. Does it look safe? An eye tracking study into the visual aspects of fear of crime.Quarterly Journal of Experimental Psychology72, 3 (2019), 599–615

work page 2019
[11]

Payam Dadvand, Xavier Bartoll, Xavier Basagaña, Albert Dalmau-Bueno, David Martinez, Albert Ambros, Marta Cirach, Margarita Triguero-Mas, Mireia Gascon, Carme Borrell, et al . 2016. Green spaces and general health: roles of mental health status, social support, and physical activity.Environment International91 (2016), 161–167

work page 2016
[12]

Liangyang Dai, Chenglong Zheng, Zekai Dong, Yao Yao, Ruifan Wang, Xiaotong Zhang, Shuliang Ren, Jiaqi Zhang, Xiaoqing Song, and Qingfeng Guan. 2021. Analyzing the correlation between visual space and residents’ psychology in Wuhan, China using street-view images and deep-learning technique.City and Environment Interactions11 (2021), 100069

work page 2021
[13]

Ap Dijksterhuis and John A Bargh. 2001. The perception-behavior express- way: Automatic effects of social perception on social behavior. InAdvances in Experimental Social Psychology. Vol. 33. Elsevier, 1–40

work page 2001
[14]

Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, and César A Hidalgo. 2016. Deep learning the city: Quantifying urban perception at a global scale. InProceedings of the European Conference on Computer Vision. 196–212

work page 2016
[15]

2017.Eye tracking methodology: Theory and practice

Andrew T Duchowski. 2017.Eye tracking methodology: Theory and practice. Springer

work page 2017
[16]

Kaiqun Fu, Zhiqian Chen, and Chang-Tien Lu. 2018. Streetnet: preference learn- ing with convolutional neural network on urban crime perception. InProceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 269–278

work page 2018
[17]

Paul H Gobster and Lynne M Westphal. 2004. The human dimensions of urban greenways: planning for recreation and related experiences.Landscape and Urban Planning68, 2-3 (2004), 147–165

work page 2004
[18]

John M Henderson. 2003. Human gaze control during real-world scene perception. Trends in Cognitive Sciences7, 11 (2003), 498–504

work page 2003
[19]

Henderson

John M. Henderson. 2011. Eye movements and scene perception. InThe Oxford Handbook of Eye Movements, Simon P. Liversedge, Iain Gilchrist, and Stefan Everling (Eds.). Oxford University Press, Oxford

work page 2011
[20]

John M Henderson, Svetlana V Shinkareva, Jing Wang, Steven G Luke, and Jenn Olejarczyk. 2013. Predicting cognitive state from eye movements.PLOS ONE8, 5 (2013), e64937

work page 2013
[21]

Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. 2024. Global Streetscapes-A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics.ISPRS Journal of Photogrammetry and Remote Sensing215 (2024), 216–238

work page 2024
[22]

Koichi Ito, Yuhao Kang, Ye Zhang, Fan Zhang, and Filip Biljecki. 2024. Under- standing urban perception with visual data: A systematic review.Cities152 (2024), 105169

work page 2024
[23]

Yuhao Kang, Junda Chen, Liu Liu, Kshitij Sharma, Martina Mazzarello, Simone Mora, Fábio Duarte, and Carlo Ratti. 2026. Decoding human safety perception with eye-tracking systems, street view images, and explainable AI.Computers, Environment and Urban Systems123 (2026), 102356

work page 2026
[24]

Yuhao Kang, Fan Zhang, Song Gao, Hui Lin, and Yu Liu. 2020. A review of urban physical environment sensing using street view imagery in public health studies. Annals of GIS26, 3 (2020), 261–275

work page 2020
[25]

George L Kelling and James Q Wilson. 1982. Broken windows.Atlantic Monthly 249, 3 (1982), 29–38

work page 1982
[26]

Peter Kiefer, Ioannis Giannopoulos, Martin Raubal, and Andrew Duchowski

work page
[27]

Spatial Cognition & Computation17, 1-2 (2017), 1–19

Eye tracking for spatial research: Cognition, computation, challenges. Spatial Cognition & Computation17, 1-2 (2017), 1–19

work page 2017
[28]

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. 2020. Captum: A unified and generic model interpretability library for PyTorch.arXiv preprint arXiv:2009.07896(2020)

work page arXiv 2020
[29]

Ian Krajbich, Carrie Armel, and Antonio Rangel. 2010. Visual fixations and the computation and comparison of value in simple choice.Nature Neuroscience13, 10 (2010), 1292–1298

work page 2010
[30]

Krzysztof Krejtz, Andrew T Duchowski, Anna Niedzielska, Cezary Biele, and Izabela Krejtz. 2018. Eye tracking cognitive load using pupil diameter and microsaccades with fixed gaze.PLOS ONE13, 9 (2018), e0203629

work page 2018
[31]

2018.Content analysis: An introduction to its methodology

Klaus Krippendorff. 2018.Content analysis: An introduction to its methodology. SAGE Publications

work page 2018
[32]

Yuki Kubota, Kota Tsubouchi, Soto Anno, Kaito Ide, and Masamichi Shimosaka

work page
[33]

InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems

Omni-CityMood: Vision-based urban atmosphere perception from every angle. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 186–196

work page
[34]

Jie Li, Zhonghao Zhang, Fu Jing, Jun Gao, Jianyu Ma, Guofan Shao, and Scott Noel. 2020. An evaluation of urban green space in Shanghai, China, using eye tracking.Urban Forestry & Urban Greening56 (2020), 126903

work page 2020
[35]

Yin Li, Miao Liu, and James M Rehg. 2021. In the eye of the beholder: Gaze and actions in first person video.IEEE Transactions on Pattern Analysis and Machine Intelligence45, 6 (2021), 6731–6747

work page 2021
[36]

Yunqin Li, Nobuyoshi Yabuki, and Tomohiro Fukuda. 2023. Integrating GIS, deep learning, and environmental sensors for multicriteria evaluation of urban street walkability.Landscape and Urban Planning230 (2023), 104603

work page 2023
[37]

Dillon Lohr and Oleg V Komogortsev. 2022. Eye know you too: Toward viable end-to-end eye movement biometrics for user authentication.IEEE Transactions on Information Forensics and Security17 (2022), 3151–3164

work page 2022
[38]

1964.The image of the city

Kevin Lynch. 1964.The image of the city. MIT Press

work page 1964
[39]

Bhanuka Mahanama, Yasith Jayawardana, Sundararaman Rengarajan, Gavindya Jayawardena, Leanne Chukoskie, Joseph Snider, and Sampath Jayarathna. 2022. Eye movement and pupil measures: A review.Frontiers in Computer Science3 (2022), 733531

work page 2022
[40]

Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, and Shuqiang Jiang. 2019. Multi-task deep relative attribute learning for visual urban perception.IEEE Transactions on Image Processing29 (2019), 657–669

work page 2019
[41]

Montello and Martin Raubal

Daniel R. Montello and Martin Raubal. 2013. Functions and applications of spatial cognition. InHandbook of Spatial Cognition, David Waller and Lynn Nadel (Eds.). American Psychological Association, Washington, DC, 249–264

work page 2013
[42]

Felipe Moreno-Vera, Bahram Lavi, and Jorge Poco. 2021. Quantifying urban safety perception on street view images. InProceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. 611–616

work page 2021
[43]

Nikhil Naik, Jade Philipoom, Ramesh Raskar, and César Hidalgo. 2014. Streetscore-predicting the perceived safety of one million streetscapes. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 779–785

work page 2014
[44]

Jack L Nasar. 1990. The evaluative image of the city.Journal of the American Planning Association56, 1 (1990), 41–53

work page 1990
[45]

Jakub Štěpán Novák, Jan Masner, Petr Benda, Pavel Šimek, and Vojtěch Merunka

work page
[46]

Eye tracking, usability, and user experience: A systematic review.Interna- tional Journal of Human–Computer Interaction40, 17 (2024), 4484–4500

work page 2024
[47]

Süleyman Özdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, and Enkelejda Kasneci. 2024. Gaze-guided graph neural network for action antici- pation conditioned on intention. InProceedings of the 2024 Symposium on Eye Tracking Research and Applications. 1–9

work page 2024
[48]

Ilias O Pappas, Kshitij Sharma, Patrick Mikalef, and Michail N Giannakos. 2020. How quickly can we predict users’ ratings on aesthetic evaluations of websites? Employing machine learning on eye-tracking data. InConference on e-Business, e-Services and e-Society. 429–440

work page 2020
[49]

Yunmi Park and Max Garcia. 2020. Pedestrian safety perception and urban street settings.International Journal of Sustainable Transportation14, 11 (2020), 860–871

work page 2020
[50]

Lorenzo Porzi, Samuel Rota Bulò, Bruno Lepri, and Elisa Ricci. 2015. Predicting and understanding urban perception with convolutional neural networks. In Proceedings of the 23rd ACM International Conference on Multimedia. 139–148

work page 2015
[51]

Matias Quintana, Youlong Gu, and Filip Biljecki. 2024. My street is better than your street: Towards data-driven urban planning with visual perception. In Proceedings of the 11th ACM International Conference on Systems for Energy- Efficient Buildings, Cities, and Transportation. 221–222

work page 2024
[52]

Matias Quintana, Youlong Gu, Xiucheng Liang, Yujun Hou, Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, and Filip Biljecki. 2025. Global urban visual perception varies across demographics and personalities.Nature Cities(2025), 1–15

work page 2025
[53]

Keith Rayner. 2009. Eye movements and attention in reading, scene perception, and visual search.The Quarterly Journal of Experimental Psychology62, 8 (2009), 1457–1506

work page 2009
[54]

Catherine E Ross and John Mirowsky. 2001. Neighborhood disadvantage, disorder, and health.Journal of Health and Social Behavior42, 3 (2001), 258–276

work page 2001
[55]

Philip Salesses, Katja Schechtner, and César A Hidalgo. 2013. The collaborative image of the city: mapping the inequality of urban perception.PLOS ONE8, 7 (2013), e68400

work page 2013
[56]

Dario D Salvucci and Joseph H Goldberg. 2000. Identifying fixations and saccades in eye-tracking protocols. InProceedings of the 2000 Symposium on Eye Tracking Research & Applications. 71–78

work page 2000
[57]

Abdulrahman Mohamed Selim, Michael Barz, Omair Shahzad Bhatti, Hasan Md Tusfiqur Alam, and Daniel Sonntag. 2024. A review of machine learning in scanpath analysis for passive gaze-based interaction.Frontiers in Artificial Intelligence7 (2024), 1391745

work page 2024
[58]

Shinsuke Shimojo, Claudiu Simion, Eiko Shimojo, and Christian Scheier. 2003. Gaze bias both reflects and influences preference.Nature Neuroscience6, 12 (2003), 1317–1322

work page 2003
[59]

Harshinee Sriram, Cristina Conati, and Thalia Field. 2023. Classification of Alzheimer’s disease with deep learning on eye-tracking data. InProceedings of the 25th International Conference on Multimodal Interaction. 104–113

work page 2023
[60]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Che et al. Learning. 3319–3328

work page 2017
[61]

Arash Tavakoli, Isabella P Douglas, Hae Young Noh, Jackelyn Hwang, and Sarah L Billington. 2025. Psycho-behavioral responses to urban scenes: An exploration through eye-tracking.Cities156 (2025), 105568

work page 2025
[62]

Tobii. 2025. Tobii Pro Spectrum. https://www.tobii.com/products/eye-trackers/ screen-based/tobii-pro-spectrum Accessed 2026-03-21

work page 2025
[63]

Deltcho Valtchanov and Colin G Ellard. 2015. Cognitive and affective responses to natural scenes: Effects of low level visual properties on preference, cognitive load and eye-movements.Journal of Environmental Psychology43 (2015), 184–195

work page 2015
[64]

Lei Wang, Xin Han, Jie He, and Taeyeol Jung. 2022. Measuring residents’ percep- tions of city streets to inform better street planning through deep learning and space syntax.ISPRS Journal of Photogrammetry and Remote Sensing190 (2022), 215–230

work page 2022
[65]

Ruili Wang, Fan Yang, and Qingqin Wang. 2025. Emotion-based design research of rural street spaces using eye-tracking technology: A case study of Huixingtou Village in Handan City.PLOS ONE20, 6 (2025), e0326049

work page 2025
[66]

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. 2021. Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data. InProceed- ings of the IEEE/CVF International Conference on Computer Vision. 1905–1914

work page 2021
[67]

Zeyu Wang, Koichi Ito, and Filip Biljecki. 2024. Assessing the equity and evolution of urban visual perceptual quality with time series street view imagery.Cities 145 (2024), 104704

work page 2024
[68]

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual transformers: Token-based image representation and processing for com- puter vision.arXiv preprint arXiv:2006.03677(2020)

work page arXiv 2020
[69]

Nai Yang, Zhitao Deng, Fangtai Hu, Yi Chao, Lin Wan, Qingfeng Guan, and Zhiwei Wei. 2024. Urban perception by using eye movement data on street view images.Transactions in GIS28, 5 (2024), 1021–1042

work page 2024
[70]

Yao Yao, Zhaotang Liang, Zehao Yuan, Penghua Liu, Yongpan Bie, Jinbao Zhang, Ruoyu Wang, Jiale Wang, and Qingfeng Guan. 2019. A human-machine adver- sarial scoring framework for urban perception assessment using street-view images.International Journal of Geographical Information Science33, 12 (2019), 2363–2384

work page 2019
[71]

A. L. Yarbus. 1967.Eye Movements and Vision. Springer. Modeling Subjective Urban Perception with Human Gaze Appendix A Dataset and Analysis A.1 Inter-rater Variability Distribution Figure 5 provides the full Distribution of Mean Pairwise Distance distributions for the three perception dimensions, complementing the discussion in Sec. 3.3. 0.00 0.25 0.50 0....

work page 1967

[1] [1]

Andreas Bulling, Jamie A Ward, Hans Gellersen, and Gerhard Tröster. 2010. Eye movement analysis for activity recognition using electrooculography.IEEE Transactions on Pattern Analysis and Machine Intelligence33, 4 (2010), 741–753

work page 2010

[2] [2]

Patrick Cavanagh. 2011. Visual cognition.Vision Research51, 13 (2011), 1538– 1551

work page 2011

[3] [3]

Vania Ceccato, Yuhao Kang, Jonatan Abraham, Per Näsman, Fábio Duarte, Song Gao, Lukas Ljungqvist, Fan Zhang, and Carlo Ratti. 2026. What makes a place safe? Assessing AI-generated safety perception scores using Stockholm’s street view images.The British Journal of Criminology66, 2 (2026), 265–289

work page 2026

[4] [4]

Lin Che, Yizi Chen, Tanhua Jin, Martin Raubal, Konrad Schindler, and Peter Kiefer. 2025. Unsupervised urban land use mapping with street view contrastive clustering and a geographical prior. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 28–38

work page 2025

[5] [5]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794

work page 2016

[6] [6]

Xianyu Chen, Ming Jiang, and Qi Zhao. 2021. Predicting human scanpaths in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10876–10885

work page 2021

[7] [7]

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmen- tation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1290–1299

work page 2022

[8] [8]

Broken Windows

Deborah Cohen, Suzanne Spear, Richard Scribner, Patty Kissinger, Karen Mason, and John Wildgen. 2000. “Broken Windows” and the risk of gonorrhea.American Journal of Public Health90, 2 (2000), 230

work page 2000

[9] [9]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. Modeling Subjective Urban Perception with Human Gaze The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213–3223

work page 2016

[10] [10]

Freya Crosby and Frouke Hermens. 2019. Does it look safe? An eye tracking study into the visual aspects of fear of crime.Quarterly Journal of Experimental Psychology72, 3 (2019), 599–615

work page 2019

[11] [11]

Payam Dadvand, Xavier Bartoll, Xavier Basagaña, Albert Dalmau-Bueno, David Martinez, Albert Ambros, Marta Cirach, Margarita Triguero-Mas, Mireia Gascon, Carme Borrell, et al . 2016. Green spaces and general health: roles of mental health status, social support, and physical activity.Environment International91 (2016), 161–167

work page 2016

[12] [12]

Liangyang Dai, Chenglong Zheng, Zekai Dong, Yao Yao, Ruifan Wang, Xiaotong Zhang, Shuliang Ren, Jiaqi Zhang, Xiaoqing Song, and Qingfeng Guan. 2021. Analyzing the correlation between visual space and residents’ psychology in Wuhan, China using street-view images and deep-learning technique.City and Environment Interactions11 (2021), 100069

work page 2021

[13] [13]

Ap Dijksterhuis and John A Bargh. 2001. The perception-behavior express- way: Automatic effects of social perception on social behavior. InAdvances in Experimental Social Psychology. Vol. 33. Elsevier, 1–40

work page 2001

[14] [14]

Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, and César A Hidalgo. 2016. Deep learning the city: Quantifying urban perception at a global scale. InProceedings of the European Conference on Computer Vision. 196–212

work page 2016

[15] [15]

2017.Eye tracking methodology: Theory and practice

Andrew T Duchowski. 2017.Eye tracking methodology: Theory and practice. Springer

work page 2017

[16] [16]

Kaiqun Fu, Zhiqian Chen, and Chang-Tien Lu. 2018. Streetnet: preference learn- ing with convolutional neural network on urban crime perception. InProceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 269–278

work page 2018

[17] [17]

Paul H Gobster and Lynne M Westphal. 2004. The human dimensions of urban greenways: planning for recreation and related experiences.Landscape and Urban Planning68, 2-3 (2004), 147–165

work page 2004

[18] [18]

John M Henderson. 2003. Human gaze control during real-world scene perception. Trends in Cognitive Sciences7, 11 (2003), 498–504

work page 2003

[19] [19]

Henderson

John M. Henderson. 2011. Eye movements and scene perception. InThe Oxford Handbook of Eye Movements, Simon P. Liversedge, Iain Gilchrist, and Stefan Everling (Eds.). Oxford University Press, Oxford

work page 2011

[20] [20]

John M Henderson, Svetlana V Shinkareva, Jing Wang, Steven G Luke, and Jenn Olejarczyk. 2013. Predicting cognitive state from eye movements.PLOS ONE8, 5 (2013), e64937

work page 2013

[21] [21]

Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. 2024. Global Streetscapes-A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics.ISPRS Journal of Photogrammetry and Remote Sensing215 (2024), 216–238

work page 2024

[22] [22]

Koichi Ito, Yuhao Kang, Ye Zhang, Fan Zhang, and Filip Biljecki. 2024. Under- standing urban perception with visual data: A systematic review.Cities152 (2024), 105169

work page 2024

[23] [23]

Yuhao Kang, Junda Chen, Liu Liu, Kshitij Sharma, Martina Mazzarello, Simone Mora, Fábio Duarte, and Carlo Ratti. 2026. Decoding human safety perception with eye-tracking systems, street view images, and explainable AI.Computers, Environment and Urban Systems123 (2026), 102356

work page 2026

[24] [24]

Yuhao Kang, Fan Zhang, Song Gao, Hui Lin, and Yu Liu. 2020. A review of urban physical environment sensing using street view imagery in public health studies. Annals of GIS26, 3 (2020), 261–275

work page 2020

[25] [25]

George L Kelling and James Q Wilson. 1982. Broken windows.Atlantic Monthly 249, 3 (1982), 29–38

work page 1982

[26] [26]

Peter Kiefer, Ioannis Giannopoulos, Martin Raubal, and Andrew Duchowski

work page

[27] [27]

Spatial Cognition & Computation17, 1-2 (2017), 1–19

Eye tracking for spatial research: Cognition, computation, challenges. Spatial Cognition & Computation17, 1-2 (2017), 1–19

work page 2017

[28] [28]

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. 2020. Captum: A unified and generic model interpretability library for PyTorch.arXiv preprint arXiv:2009.07896(2020)

work page arXiv 2020

[29] [29]

Ian Krajbich, Carrie Armel, and Antonio Rangel. 2010. Visual fixations and the computation and comparison of value in simple choice.Nature Neuroscience13, 10 (2010), 1292–1298

work page 2010

[30] [30]

Krzysztof Krejtz, Andrew T Duchowski, Anna Niedzielska, Cezary Biele, and Izabela Krejtz. 2018. Eye tracking cognitive load using pupil diameter and microsaccades with fixed gaze.PLOS ONE13, 9 (2018), e0203629

work page 2018

[31] [31]

2018.Content analysis: An introduction to its methodology

Klaus Krippendorff. 2018.Content analysis: An introduction to its methodology. SAGE Publications

work page 2018

[32] [32]

Yuki Kubota, Kota Tsubouchi, Soto Anno, Kaito Ide, and Masamichi Shimosaka

work page

[33] [33]

InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems

Omni-CityMood: Vision-based urban atmosphere perception from every angle. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 186–196

work page

[34] [34]

Jie Li, Zhonghao Zhang, Fu Jing, Jun Gao, Jianyu Ma, Guofan Shao, and Scott Noel. 2020. An evaluation of urban green space in Shanghai, China, using eye tracking.Urban Forestry & Urban Greening56 (2020), 126903

work page 2020

[35] [35]

Yin Li, Miao Liu, and James M Rehg. 2021. In the eye of the beholder: Gaze and actions in first person video.IEEE Transactions on Pattern Analysis and Machine Intelligence45, 6 (2021), 6731–6747

work page 2021

[36] [36]

Yunqin Li, Nobuyoshi Yabuki, and Tomohiro Fukuda. 2023. Integrating GIS, deep learning, and environmental sensors for multicriteria evaluation of urban street walkability.Landscape and Urban Planning230 (2023), 104603

work page 2023

[37] [37]

Dillon Lohr and Oleg V Komogortsev. 2022. Eye know you too: Toward viable end-to-end eye movement biometrics for user authentication.IEEE Transactions on Information Forensics and Security17 (2022), 3151–3164

work page 2022

[38] [38]

1964.The image of the city

Kevin Lynch. 1964.The image of the city. MIT Press

work page 1964

[39] [39]

Bhanuka Mahanama, Yasith Jayawardana, Sundararaman Rengarajan, Gavindya Jayawardena, Leanne Chukoskie, Joseph Snider, and Sampath Jayarathna. 2022. Eye movement and pupil measures: A review.Frontiers in Computer Science3 (2022), 733531

work page 2022

[40] [40]

Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, and Shuqiang Jiang. 2019. Multi-task deep relative attribute learning for visual urban perception.IEEE Transactions on Image Processing29 (2019), 657–669

work page 2019

[41] [41]

Montello and Martin Raubal

Daniel R. Montello and Martin Raubal. 2013. Functions and applications of spatial cognition. InHandbook of Spatial Cognition, David Waller and Lynn Nadel (Eds.). American Psychological Association, Washington, DC, 249–264

work page 2013

[42] [42]

Felipe Moreno-Vera, Bahram Lavi, and Jorge Poco. 2021. Quantifying urban safety perception on street view images. InProceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. 611–616

work page 2021

[43] [43]

Nikhil Naik, Jade Philipoom, Ramesh Raskar, and César Hidalgo. 2014. Streetscore-predicting the perceived safety of one million streetscapes. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 779–785

work page 2014

[44] [44]

Jack L Nasar. 1990. The evaluative image of the city.Journal of the American Planning Association56, 1 (1990), 41–53

work page 1990

[45] [45]

Jakub Štěpán Novák, Jan Masner, Petr Benda, Pavel Šimek, and Vojtěch Merunka

work page

[46] [46]

Eye tracking, usability, and user experience: A systematic review.Interna- tional Journal of Human–Computer Interaction40, 17 (2024), 4484–4500

work page 2024

[47] [47]

Süleyman Özdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, and Enkelejda Kasneci. 2024. Gaze-guided graph neural network for action antici- pation conditioned on intention. InProceedings of the 2024 Symposium on Eye Tracking Research and Applications. 1–9

work page 2024

[48] [48]

Ilias O Pappas, Kshitij Sharma, Patrick Mikalef, and Michail N Giannakos. 2020. How quickly can we predict users’ ratings on aesthetic evaluations of websites? Employing machine learning on eye-tracking data. InConference on e-Business, e-Services and e-Society. 429–440

work page 2020

[49] [49]

Yunmi Park and Max Garcia. 2020. Pedestrian safety perception and urban street settings.International Journal of Sustainable Transportation14, 11 (2020), 860–871

work page 2020

[50] [50]

Lorenzo Porzi, Samuel Rota Bulò, Bruno Lepri, and Elisa Ricci. 2015. Predicting and understanding urban perception with convolutional neural networks. In Proceedings of the 23rd ACM International Conference on Multimedia. 139–148

work page 2015

[51] [51]

Matias Quintana, Youlong Gu, and Filip Biljecki. 2024. My street is better than your street: Towards data-driven urban planning with visual perception. In Proceedings of the 11th ACM International Conference on Systems for Energy- Efficient Buildings, Cities, and Transportation. 221–222

work page 2024

[52] [52]

Matias Quintana, Youlong Gu, Xiucheng Liang, Yujun Hou, Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, and Filip Biljecki. 2025. Global urban visual perception varies across demographics and personalities.Nature Cities(2025), 1–15

work page 2025

[53] [53]

Keith Rayner. 2009. Eye movements and attention in reading, scene perception, and visual search.The Quarterly Journal of Experimental Psychology62, 8 (2009), 1457–1506

work page 2009

[54] [54]

Catherine E Ross and John Mirowsky. 2001. Neighborhood disadvantage, disorder, and health.Journal of Health and Social Behavior42, 3 (2001), 258–276

work page 2001

[55] [55]

Philip Salesses, Katja Schechtner, and César A Hidalgo. 2013. The collaborative image of the city: mapping the inequality of urban perception.PLOS ONE8, 7 (2013), e68400

work page 2013

[56] [56]

Dario D Salvucci and Joseph H Goldberg. 2000. Identifying fixations and saccades in eye-tracking protocols. InProceedings of the 2000 Symposium on Eye Tracking Research & Applications. 71–78

work page 2000

[57] [57]

Abdulrahman Mohamed Selim, Michael Barz, Omair Shahzad Bhatti, Hasan Md Tusfiqur Alam, and Daniel Sonntag. 2024. A review of machine learning in scanpath analysis for passive gaze-based interaction.Frontiers in Artificial Intelligence7 (2024), 1391745

work page 2024

[58] [58]

Shinsuke Shimojo, Claudiu Simion, Eiko Shimojo, and Christian Scheier. 2003. Gaze bias both reflects and influences preference.Nature Neuroscience6, 12 (2003), 1317–1322

work page 2003

[59] [59]

Harshinee Sriram, Cristina Conati, and Thalia Field. 2023. Classification of Alzheimer’s disease with deep learning on eye-tracking data. InProceedings of the 25th International Conference on Multimodal Interaction. 104–113

work page 2023

[60] [60]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Che et al. Learning. 3319–3328

work page 2017

[61] [61]

Arash Tavakoli, Isabella P Douglas, Hae Young Noh, Jackelyn Hwang, and Sarah L Billington. 2025. Psycho-behavioral responses to urban scenes: An exploration through eye-tracking.Cities156 (2025), 105568

work page 2025

[62] [62]

Tobii. 2025. Tobii Pro Spectrum. https://www.tobii.com/products/eye-trackers/ screen-based/tobii-pro-spectrum Accessed 2026-03-21

work page 2025

[63] [63]

Deltcho Valtchanov and Colin G Ellard. 2015. Cognitive and affective responses to natural scenes: Effects of low level visual properties on preference, cognitive load and eye-movements.Journal of Environmental Psychology43 (2015), 184–195

work page 2015

[64] [64]

Lei Wang, Xin Han, Jie He, and Taeyeol Jung. 2022. Measuring residents’ percep- tions of city streets to inform better street planning through deep learning and space syntax.ISPRS Journal of Photogrammetry and Remote Sensing190 (2022), 215–230

work page 2022

[65] [65]

Ruili Wang, Fan Yang, and Qingqin Wang. 2025. Emotion-based design research of rural street spaces using eye-tracking technology: A case study of Huixingtou Village in Handan City.PLOS ONE20, 6 (2025), e0326049

work page 2025

[66] [66]

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. 2021. Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data. InProceed- ings of the IEEE/CVF International Conference on Computer Vision. 1905–1914

work page 2021

[67] [67]

Zeyu Wang, Koichi Ito, and Filip Biljecki. 2024. Assessing the equity and evolution of urban visual perceptual quality with time series street view imagery.Cities 145 (2024), 104704

work page 2024

[68] [68]

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual transformers: Token-based image representation and processing for com- puter vision.arXiv preprint arXiv:2006.03677(2020)

work page arXiv 2020

[69] [69]

Nai Yang, Zhitao Deng, Fangtai Hu, Yi Chao, Lin Wan, Qingfeng Guan, and Zhiwei Wei. 2024. Urban perception by using eye movement data on street view images.Transactions in GIS28, 5 (2024), 1021–1042

work page 2024

[70] [70]

Yao Yao, Zhaotang Liang, Zehao Yuan, Penghua Liu, Yongpan Bie, Jinbao Zhang, Ruoyu Wang, Jiale Wang, and Qingfeng Guan. 2019. A human-machine adver- sarial scoring framework for urban perception assessment using street-view images.International Journal of Geographical Information Science33, 12 (2019), 2363–2384

work page 2019

[71] [71]

A. L. Yarbus. 1967.Eye Movements and Vision. Springer. Modeling Subjective Urban Perception with Human Gaze Appendix A Dataset and Analysis A.1 Inter-rater Variability Distribution Figure 5 provides the full Distribution of Mean Pairwise Distance distributions for the three perception dimensions, complementing the discussion in Sec. 3.3. 0.00 0.25 0.50 0....

work page 1967