pith. sign in

arxiv: 2605.00764 · v1 · submitted 2026-05-01 · 💻 cs.CV · cs.AI· cs.HC

Modeling Subjective Urban Perception with Human Gaze

Pith reviewed 2026-05-09 19:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC
keywords urban perceptioneye trackinggaze behaviorstreet view imagessubjective evaluationmultimodal modelingperception prediction
0
0 comments X

The pith

Gaze data improves predictions of subjective urban perception from street view images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a dataset that pairs street view images with eye-tracking data and subjective perception labels. It introduces a framework to evaluate gaze in three settings: using gaze by itself, combining it with semantic scene info, and combining it with detailed visual features. Experiments reveal that gaze carries independent predictive value for perception and that fusion with either type of scene representation boosts model performance. This matters because it points to the value of modeling the human viewing process rather than treating perception as a direct function of image content alone.

Core claim

Gaze alone already carries useful predictive signals for subjective urban perception, and integrating gaze with scene representations further improves prediction under both semantic and richer visual representations.

What carries the argument

The Gaze-Guided Urban Perception Framework, which tests gaze-only modeling and gaze fusion with semantic and visual scene representations to predict perception labels.

Load-bearing premise

The eye-tracking recordings accurately capture the perceptual processes that viewers use to form their subjective urban perception judgments.

What would settle it

Running the same prediction experiments on a new, independent dataset of eye-tracked street views where adding gaze information fails to improve accuracy over image-only baselines.

Figures

Figures reproduced from arXiv: 2605.00764 by Konrad Schindler, Lin Che, Marc Pollefeys, Martin Raubal, Peter Kiefer, Xi Wang.

Figure 1
Figure 1. Figure 1: Significant gaze-only features under one-way ANOVA across perception levels (Low/Neutral/High). Features with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Significant AOI fixation features under one-way ANOVA across perception levels (Low/Neutral/High). The dashed line [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed Gaze-guided Urban Perception Framework. Raw gaze recordings are first segmented into [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative attribution comparison of the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of Mean Pairwise Distance (MPD) between participants for the three perception dimensions. Ratings are [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Place Pulse-Gaze dataset, which augments street-view images with synchronized eye-tracking recordings and per-participant subjective perception labels (e.g., safety, liveliness). It proposes a Gaze-Guided Urban Perception Framework that evaluates three settings—gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations—and reports that gaze alone supplies useful predictive signals while fusion yields further gains.

Significance. If the quantitative results and controls hold, the work is significant for shifting urban perception modeling from purely image-based approaches to ones that explicitly incorporate human perceptual processes via gaze. The new dataset is a concrete contribution that can support follow-on research on multimodal urban computing and human-aligned scene understanding.

major comments (2)
  1. [Experiments] Experiments section: no comparison is reported against standard bottom-up saliency models (e.g., Itti-Koch or modern deep saliency predictors) as a control. Without this, it is impossible to determine whether the reported predictive power of gaze-only and fusion models arises from signals specific to subjective perception judgments or from generic image-content correlations that any saliency map would capture.
  2. [§3 and §4] §3 (Dataset) and §4 (Framework): the description of the eye-tracking protocol and label-collection procedure does not include per-attribute alignment analysis or controls that would verify that fixation patterns are driven by the higher-level attributes being labeled rather than low-level visual features. This directly affects the validity of the central claim that gaze data models the formation of subjective judgments.
minor comments (2)
  1. [Abstract] The abstract states positive outcomes across three settings but supplies no numerical metrics, error bars, or baseline comparisons; these should be added for immediate readability.
  2. [§4] Notation for the fusion modules (semantic vs. visual) is introduced without an explicit equation or diagram showing how gaze features are combined with scene features; a small schematic would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the specificity of our gaze-based signals and strengthen the connection between gaze patterns and high-level attributes. We address each major comment below and propose targeted revisions.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: no comparison is reported against standard bottom-up saliency models (e.g., Itti-Koch or modern deep saliency predictors) as a control. Without this, it is impossible to determine whether the reported predictive power of gaze-only and fusion models arises from signals specific to subjective perception judgments or from generic image-content correlations that any saliency map would capture.

    Authors: We agree that this control is essential to isolate whether gaze data contributes signals tied to subjective judgments beyond generic visual saliency. In the revised manuscript, we will add comparisons using the Itti-Koch model and a modern deep saliency predictor (e.g., DeepGaze). Saliency maps will be extracted from the street-view images and evaluated both in isolation and fused with scene representations, directly benchmarking against our gaze-only and gaze-fusion results to demonstrate the added value of human gaze. revision: yes

  2. Referee: [§3 and §4] §3 (Dataset) and §4 (Framework): the description of the eye-tracking protocol and label-collection procedure does not include per-attribute alignment analysis or controls that would verify that fixation patterns are driven by the higher-level attributes being labeled rather than low-level visual features. This directly affects the validity of the central claim that gaze data models the formation of subjective judgments.

    Authors: We acknowledge that explicit per-attribute alignment analysis would better validate that gaze reflects high-level subjective attributes rather than low-level features. The original submission did not include such post-hoc analysis. In the revision, we will expand §3 and §4 with new analysis correlating fixation patterns (e.g., duration and spatial distribution) with individual attribute labels across participants, and we will incorporate controls for low-level features by referencing the saliency model comparisons added to the experiments. This will directly support the claim that gaze models subjective judgment formation. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new empirical dataset and experimental comparisons

full rationale

The paper collects a new Place Pulse-Gaze dataset pairing street-view images with synchronized eye-tracking and perception labels, then evaluates three modeling settings (gaze-only, semantic fusion, visual fusion) via reported performance metrics. No derivation chain, equations, or fitted parameters are defined in terms of the target predictions; results are presented as direct outcomes of training and testing on the held-out data. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known empirical patterns are merely renamed. The central claims therefore remain externally falsifiable through the released dataset and models rather than reducing to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical computer vision and human-computer interaction paper centered on new data collection and model evaluation. No mathematical axioms, free parameters, or invented entities are invoked in the abstract; the claims depend on experimental outcomes from the introduced dataset.

pith-pipeline@v0.9.0 · 5480 in / 1129 out tokens · 48739 ms · 2026-05-09T19:09:53.029880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

  1. [1]

    Andreas Bulling, Jamie A Ward, Hans Gellersen, and Gerhard Tröster. 2010. Eye movement analysis for activity recognition using electrooculography.IEEE Transactions on Pattern Analysis and Machine Intelligence33, 4 (2010), 741–753

  2. [2]

    Patrick Cavanagh. 2011. Visual cognition.Vision Research51, 13 (2011), 1538– 1551

  3. [3]

    Vania Ceccato, Yuhao Kang, Jonatan Abraham, Per Näsman, Fábio Duarte, Song Gao, Lukas Ljungqvist, Fan Zhang, and Carlo Ratti. 2026. What makes a place safe? Assessing AI-generated safety perception scores using Stockholm’s street view images.The British Journal of Criminology66, 2 (2026), 265–289

  4. [4]

    Lin Che, Yizi Chen, Tanhua Jin, Martin Raubal, Konrad Schindler, and Peter Kiefer. 2025. Unsupervised urban land use mapping with street view contrastive clustering and a geographical prior. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 28–38

  5. [5]

    Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794

  6. [6]

    Xianyu Chen, Ming Jiang, and Qi Zhao. 2021. Predicting human scanpaths in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10876–10885

  7. [7]

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmen- tation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1290–1299

  8. [8]

    Broken Windows

    Deborah Cohen, Suzanne Spear, Richard Scribner, Patty Kissinger, Karen Mason, and John Wildgen. 2000. “Broken Windows” and the risk of gonorrhea.American Journal of Public Health90, 2 (2000), 230

  9. [9]

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. Modeling Subjective Urban Perception with Human Gaze The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213–3223

  10. [10]

    Freya Crosby and Frouke Hermens. 2019. Does it look safe? An eye tracking study into the visual aspects of fear of crime.Quarterly Journal of Experimental Psychology72, 3 (2019), 599–615

  11. [11]

    Payam Dadvand, Xavier Bartoll, Xavier Basagaña, Albert Dalmau-Bueno, David Martinez, Albert Ambros, Marta Cirach, Margarita Triguero-Mas, Mireia Gascon, Carme Borrell, et al . 2016. Green spaces and general health: roles of mental health status, social support, and physical activity.Environment International91 (2016), 161–167

  12. [12]

    Liangyang Dai, Chenglong Zheng, Zekai Dong, Yao Yao, Ruifan Wang, Xiaotong Zhang, Shuliang Ren, Jiaqi Zhang, Xiaoqing Song, and Qingfeng Guan. 2021. Analyzing the correlation between visual space and residents’ psychology in Wuhan, China using street-view images and deep-learning technique.City and Environment Interactions11 (2021), 100069

  13. [13]

    Ap Dijksterhuis and John A Bargh. 2001. The perception-behavior express- way: Automatic effects of social perception on social behavior. InAdvances in Experimental Social Psychology. Vol. 33. Elsevier, 1–40

  14. [14]

    Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, and César A Hidalgo. 2016. Deep learning the city: Quantifying urban perception at a global scale. InProceedings of the European Conference on Computer Vision. 196–212

  15. [15]

    2017.Eye tracking methodology: Theory and practice

    Andrew T Duchowski. 2017.Eye tracking methodology: Theory and practice. Springer

  16. [16]

    Kaiqun Fu, Zhiqian Chen, and Chang-Tien Lu. 2018. Streetnet: preference learn- ing with convolutional neural network on urban crime perception. InProceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 269–278

  17. [17]

    Paul H Gobster and Lynne M Westphal. 2004. The human dimensions of urban greenways: planning for recreation and related experiences.Landscape and Urban Planning68, 2-3 (2004), 147–165

  18. [18]

    John M Henderson. 2003. Human gaze control during real-world scene perception. Trends in Cognitive Sciences7, 11 (2003), 498–504

  19. [19]

    Henderson

    John M. Henderson. 2011. Eye movements and scene perception. InThe Oxford Handbook of Eye Movements, Simon P. Liversedge, Iain Gilchrist, and Stefan Everling (Eds.). Oxford University Press, Oxford

  20. [20]

    John M Henderson, Svetlana V Shinkareva, Jing Wang, Steven G Luke, and Jenn Olejarczyk. 2013. Predicting cognitive state from eye movements.PLOS ONE8, 5 (2013), e64937

  21. [21]

    Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. 2024. Global Streetscapes-A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics.ISPRS Journal of Photogrammetry and Remote Sensing215 (2024), 216–238

  22. [22]

    Koichi Ito, Yuhao Kang, Ye Zhang, Fan Zhang, and Filip Biljecki. 2024. Under- standing urban perception with visual data: A systematic review.Cities152 (2024), 105169

  23. [23]

    Yuhao Kang, Junda Chen, Liu Liu, Kshitij Sharma, Martina Mazzarello, Simone Mora, Fábio Duarte, and Carlo Ratti. 2026. Decoding human safety perception with eye-tracking systems, street view images, and explainable AI.Computers, Environment and Urban Systems123 (2026), 102356

  24. [24]

    Yuhao Kang, Fan Zhang, Song Gao, Hui Lin, and Yu Liu. 2020. A review of urban physical environment sensing using street view imagery in public health studies. Annals of GIS26, 3 (2020), 261–275

  25. [25]

    George L Kelling and James Q Wilson. 1982. Broken windows.Atlantic Monthly 249, 3 (1982), 29–38

  26. [26]

    Peter Kiefer, Ioannis Giannopoulos, Martin Raubal, and Andrew Duchowski

  27. [27]

    Spatial Cognition & Computation17, 1-2 (2017), 1–19

    Eye tracking for spatial research: Cognition, computation, challenges. Spatial Cognition & Computation17, 1-2 (2017), 1–19

  28. [28]

    Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. 2020. Captum: A unified and generic model interpretability library for PyTorch.arXiv preprint arXiv:2009.07896(2020)

  29. [29]

    Ian Krajbich, Carrie Armel, and Antonio Rangel. 2010. Visual fixations and the computation and comparison of value in simple choice.Nature Neuroscience13, 10 (2010), 1292–1298

  30. [30]

    Krzysztof Krejtz, Andrew T Duchowski, Anna Niedzielska, Cezary Biele, and Izabela Krejtz. 2018. Eye tracking cognitive load using pupil diameter and microsaccades with fixed gaze.PLOS ONE13, 9 (2018), e0203629

  31. [31]

    2018.Content analysis: An introduction to its methodology

    Klaus Krippendorff. 2018.Content analysis: An introduction to its methodology. SAGE Publications

  32. [32]

    Yuki Kubota, Kota Tsubouchi, Soto Anno, Kaito Ide, and Masamichi Shimosaka

  33. [33]

    InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems

    Omni-CityMood: Vision-based urban atmosphere perception from every angle. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems. 186–196

  34. [34]

    Jie Li, Zhonghao Zhang, Fu Jing, Jun Gao, Jianyu Ma, Guofan Shao, and Scott Noel. 2020. An evaluation of urban green space in Shanghai, China, using eye tracking.Urban Forestry & Urban Greening56 (2020), 126903

  35. [35]

    Yin Li, Miao Liu, and James M Rehg. 2021. In the eye of the beholder: Gaze and actions in first person video.IEEE Transactions on Pattern Analysis and Machine Intelligence45, 6 (2021), 6731–6747

  36. [36]

    Yunqin Li, Nobuyoshi Yabuki, and Tomohiro Fukuda. 2023. Integrating GIS, deep learning, and environmental sensors for multicriteria evaluation of urban street walkability.Landscape and Urban Planning230 (2023), 104603

  37. [37]

    Dillon Lohr and Oleg V Komogortsev. 2022. Eye know you too: Toward viable end-to-end eye movement biometrics for user authentication.IEEE Transactions on Information Forensics and Security17 (2022), 3151–3164

  38. [38]

    1964.The image of the city

    Kevin Lynch. 1964.The image of the city. MIT Press

  39. [39]

    Bhanuka Mahanama, Yasith Jayawardana, Sundararaman Rengarajan, Gavindya Jayawardena, Leanne Chukoskie, Joseph Snider, and Sampath Jayarathna. 2022. Eye movement and pupil measures: A review.Frontiers in Computer Science3 (2022), 733531

  40. [40]

    Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, and Shuqiang Jiang. 2019. Multi-task deep relative attribute learning for visual urban perception.IEEE Transactions on Image Processing29 (2019), 657–669

  41. [41]

    Montello and Martin Raubal

    Daniel R. Montello and Martin Raubal. 2013. Functions and applications of spatial cognition. InHandbook of Spatial Cognition, David Waller and Lynn Nadel (Eds.). American Psychological Association, Washington, DC, 249–264

  42. [42]

    Felipe Moreno-Vera, Bahram Lavi, and Jorge Poco. 2021. Quantifying urban safety perception on street view images. InProceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. 611–616

  43. [43]

    Nikhil Naik, Jade Philipoom, Ramesh Raskar, and César Hidalgo. 2014. Streetscore-predicting the perceived safety of one million streetscapes. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 779–785

  44. [44]

    Jack L Nasar. 1990. The evaluative image of the city.Journal of the American Planning Association56, 1 (1990), 41–53

  45. [45]

    Jakub Štěpán Novák, Jan Masner, Petr Benda, Pavel Šimek, and Vojtěch Merunka

  46. [46]

    Eye tracking, usability, and user experience: A systematic review.Interna- tional Journal of Human–Computer Interaction40, 17 (2024), 4484–4500

  47. [47]

    Süleyman Özdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, and Enkelejda Kasneci. 2024. Gaze-guided graph neural network for action antici- pation conditioned on intention. InProceedings of the 2024 Symposium on Eye Tracking Research and Applications. 1–9

  48. [48]

    Ilias O Pappas, Kshitij Sharma, Patrick Mikalef, and Michail N Giannakos. 2020. How quickly can we predict users’ ratings on aesthetic evaluations of websites? Employing machine learning on eye-tracking data. InConference on e-Business, e-Services and e-Society. 429–440

  49. [49]

    Yunmi Park and Max Garcia. 2020. Pedestrian safety perception and urban street settings.International Journal of Sustainable Transportation14, 11 (2020), 860–871

  50. [50]

    Lorenzo Porzi, Samuel Rota Bulò, Bruno Lepri, and Elisa Ricci. 2015. Predicting and understanding urban perception with convolutional neural networks. In Proceedings of the 23rd ACM International Conference on Multimedia. 139–148

  51. [51]

    Matias Quintana, Youlong Gu, and Filip Biljecki. 2024. My street is better than your street: Towards data-driven urban planning with visual perception. In Proceedings of the 11th ACM International Conference on Systems for Energy- Efficient Buildings, Cities, and Transportation. 221–222

  52. [52]

    Matias Quintana, Youlong Gu, Xiucheng Liang, Yujun Hou, Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, and Filip Biljecki. 2025. Global urban visual perception varies across demographics and personalities.Nature Cities(2025), 1–15

  53. [53]

    Keith Rayner. 2009. Eye movements and attention in reading, scene perception, and visual search.The Quarterly Journal of Experimental Psychology62, 8 (2009), 1457–1506

  54. [54]

    Catherine E Ross and John Mirowsky. 2001. Neighborhood disadvantage, disorder, and health.Journal of Health and Social Behavior42, 3 (2001), 258–276

  55. [55]

    Philip Salesses, Katja Schechtner, and César A Hidalgo. 2013. The collaborative image of the city: mapping the inequality of urban perception.PLOS ONE8, 7 (2013), e68400

  56. [56]

    Dario D Salvucci and Joseph H Goldberg. 2000. Identifying fixations and saccades in eye-tracking protocols. InProceedings of the 2000 Symposium on Eye Tracking Research & Applications. 71–78

  57. [57]

    Abdulrahman Mohamed Selim, Michael Barz, Omair Shahzad Bhatti, Hasan Md Tusfiqur Alam, and Daniel Sonntag. 2024. A review of machine learning in scanpath analysis for passive gaze-based interaction.Frontiers in Artificial Intelligence7 (2024), 1391745

  58. [58]

    Shinsuke Shimojo, Claudiu Simion, Eiko Shimojo, and Christian Scheier. 2003. Gaze bias both reflects and influences preference.Nature Neuroscience6, 12 (2003), 1317–1322

  59. [59]

    Harshinee Sriram, Cristina Conati, and Thalia Field. 2023. Classification of Alzheimer’s disease with deep learning on eye-tracking data. InProceedings of the 25th International Conference on Multimodal Interaction. 104–113

  60. [60]

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Che et al. Learning. 3319–3328

  61. [61]

    Arash Tavakoli, Isabella P Douglas, Hae Young Noh, Jackelyn Hwang, and Sarah L Billington. 2025. Psycho-behavioral responses to urban scenes: An exploration through eye-tracking.Cities156 (2025), 105568

  62. [62]

    Tobii. 2025. Tobii Pro Spectrum. https://www.tobii.com/products/eye-trackers/ screen-based/tobii-pro-spectrum Accessed 2026-03-21

  63. [63]

    Deltcho Valtchanov and Colin G Ellard. 2015. Cognitive and affective responses to natural scenes: Effects of low level visual properties on preference, cognitive load and eye-movements.Journal of Environmental Psychology43 (2015), 184–195

  64. [64]

    Lei Wang, Xin Han, Jie He, and Taeyeol Jung. 2022. Measuring residents’ percep- tions of city streets to inform better street planning through deep learning and space syntax.ISPRS Journal of Photogrammetry and Remote Sensing190 (2022), 215–230

  65. [65]

    Ruili Wang, Fan Yang, and Qingqin Wang. 2025. Emotion-based design research of rural street spaces using eye-tracking technology: A case study of Huixingtou Village in Handan City.PLOS ONE20, 6 (2025), e0326049

  66. [66]

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. 2021. Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data. InProceed- ings of the IEEE/CVF International Conference on Computer Vision. 1905–1914

  67. [67]

    Zeyu Wang, Koichi Ito, and Filip Biljecki. 2024. Assessing the equity and evolution of urban visual perceptual quality with time series street view imagery.Cities 145 (2024), 104704

  68. [68]

    Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual transformers: Token-based image representation and processing for com- puter vision.arXiv preprint arXiv:2006.03677(2020)

  69. [69]

    Nai Yang, Zhitao Deng, Fangtai Hu, Yi Chao, Lin Wan, Qingfeng Guan, and Zhiwei Wei. 2024. Urban perception by using eye movement data on street view images.Transactions in GIS28, 5 (2024), 1021–1042

  70. [70]

    Yao Yao, Zhaotang Liang, Zehao Yuan, Penghua Liu, Yongpan Bie, Jinbao Zhang, Ruoyu Wang, Jiale Wang, and Qingfeng Guan. 2019. A human-machine adver- sarial scoring framework for urban perception assessment using street-view images.International Journal of Geographical Information Science33, 12 (2019), 2363–2384

  71. [71]

    A. L. Yarbus. 1967.Eye Movements and Vision. Springer. Modeling Subjective Urban Perception with Human Gaze Appendix A Dataset and Analysis A.1 Inter-rater Variability Distribution Figure 5 provides the full Distribution of Mean Pairwise Distance distributions for the three perception dimensions, complementing the discussion in Sec. 3.3. 0.00 0.25 0.50 0....