arxiv: 2605.09936 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.IR· cs.LG

Recognition: no theorem link

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Yiwei Ou , Chung Ching Cheung , Jun Yang Ang , Xiaobin Ren , Ronggui Sun , Guansong Gao , Kaiqi Zhao , Manfredo Manfredini

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:48 UTC · model grok-4.3

classification 💻 cs.CV cs.IRcs.LG

keywords urban space perceptionsocial media imagerymulti-modal datasetimage classificationinstance segmentationcross-modal retrievalurban theorycity environments

0 comments

The pith

Urban-ImageNet supplies over two million social-media images of Chinese cities organized by an urban-theory taxonomy to test AI perception of public spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a dataset of more than two million Weibo images and posts from 24 Chinese cities, grouped into a ten-class hierarchy that separates active public spaces from inactive ones, interiors from exteriors, and spatial content from portraits or non-spatial posts. It sets up three standardized tasks—scene classification, image-text retrieval, and instance segmentation—across controlled subsets from one thousand to one hundred thousand images plus the full corpus. A sympathetic reader would care because existing vision benchmarks treat city photos as generic scenes and therefore miss the spatial, social, and functional distinctions that matter for real-world urban applications. The authors show that supervised classification performs well while cross-modal retrieval and fine-grained segmentation remain difficult, with results improving as training scale increases.

Core claim

Urban-ImageNet organizes user-generated images into a ten-class taxonomy grounded in urban studies that distinguishes activated public spaces, consumption areas, accommodation, portraits, and non-spatial social-media content; the resulting benchmark then evaluates representative vision, vision-language, and segmentation models on classification, retrieval, and instance-level tasks, revealing strong supervised performance on scene labels but persistent challenges in cross-modal alignment and object segmentation that lessen only modestly with larger balanced training sets.

What carries the argument

HUSIC taxonomy, a hierarchical ten-class system grounded in urban theory that separates activated versus non-activated public spaces, exterior versus interior environments, and spatial versus non-spatial content to structure evaluation across modalities and scales.

If this is right

Supervised models reach high accuracy on urban scene classification once trained on the provided 1K to 100K subsets.
Cross-modal image-text retrieval remains harder than classification, showing limits in current vision-language alignment for urban content.
Instance segmentation improves with larger training volumes but stays more challenging than whole-scene classification.
The multi-scale design lets researchers measure exactly how much additional balanced data closes the performance gaps on each task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same taxonomy and task structure could be applied to social-media imagery from other countries or platforms to test whether the observed performance patterns hold beyond Chinese cities.
Gaps in retrieval and segmentation suggest that future models may need explicit mechanisms for functional and social context rather than purely visual features.
If the benchmark succeeds, planners and researchers could use it to train systems that automatically analyze public-space usage from the large volume of online photos already being shared.

Load-bearing premise

The HUSIC taxonomy correctly identifies the spatial, social, and functional distinctions that matter most for how people experience urban spaces and that the Weibo images represent typical city environments without major selection bias.

What would settle it

A sample of images labeled by independent urban experts shows frequent disagreement with the HUSIC classes, or models trained on the dataset achieve no better accuracy on an independent urban image collection than models trained on generic scene datasets.

Figures

Figures reproduced from arXiv: 2605.09936 by Chung Ching Cheung, Guansong Gao, Jun Yang Ang, Kaiqi Zhao, Manfredo Manfredini, Ronggui Sun, Xiaobin Ren, Yiwei Ou.

**Figure 2.** Figure 2: Overview of the Urban-ImageNet dataset construction and annotation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Urban-ImageNet-lib architecture, supporting T1, T2, and T3 in a single unified framework. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Task 2 retrieval results (avg. T2I + I2T). Category-label retrieval (left) is near-trivial after [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: T3 pipeline improvement. Adding SAM box-refinement to both detectors substantially increases AP. Mask R-CNN+SAM achieves the highest trainable AP (0.373), providing a strong open-source baseline for future work. Lighter bars: AP50; lightest: AP75 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Task 3 qualitative segmentation examples. Colour-coded instance masks from Mask R-CNN, Cascade Mask R-CNN, and Mask R-CNN+SAM. Left: ground truth; right: predictions. Scaling behaviour. All models improve monotonically across the 1K, 10K, and 100K tiers, with the 1K→10K gain (10–12%), and 10K→100K (5%), consistent with standard scaling laws [18]. Detailed scaling behaviour results and discussion of limitat… view at source ↗

**Figure 7.** Figure 7: Task 1 per-class F1 for all baselines. Classes 3, 4, 5 are consistently the most challenging; [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Confusion matrix for EfficientNet-B4. Off-diagonal mass concentrates on the activated/non [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Task 3 qualitative results (Samples 1–6). Left: ground-truth pseudo-label overlay; right: [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Geographic distribution of Urban-ImageNet’s 24 collection cities. Marker size is propor [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a new Weibo-sourced urban image dataset with a custom 10-class taxonomy and multi-task benchmarks, but the social media origin leaves representativeness unproven.

read the letter

The paper's core contribution is Urban-ImageNet: roughly 2 million Weibo images and text posts from 24 Chinese cities, split into 1K/10K/100K/2M subsets, plus the HUSIC taxonomy that breaks urban scenes into categories like activated public spaces, interiors, consumption, and portraits. They run three tasks—scene classification, cross-modal retrieval, and instance segmentation—across vision and vision-language models, and they track how results change with training set size. That combination of scale, multi-city coverage, and theory-linked labels is not something already sitting in the standard scene datasets.

Referee Report

3 major / 2 minor

Summary. The paper presents Urban-ImageNet, a dataset of over 2 million Weibo user-generated images and paired text posts from 61 sites across 24 Chinese cities (2019-2025), organized under the HUSIC hierarchical taxonomy that distinguishes activated/non-activated public spaces, interior/exterior environments, accommodation, consumption, portraits, and non-spatial content. It defines a standardized benchmark with three tasks—T1 urban scene semantic classification, T2 cross-modal image-text retrieval, and T3 instance segmentation—evaluated on controlled subsets (1K/10K/100K) and the full 2M corpus using representative vision, vision-language, and segmentation models. Experiments report differential performance (strong on supervised classification, weaker on retrieval and segmentation) and examine scaling effects with increasing balanced training data. The work positions the resource as a theory-grounded, multi-city, multi-modal benchmark for AI perception of contemporary urban spaces, with public release via Hugging Face and GitHub.

Significance. If the HUSIC taxonomy proves reliable and the Weibo corpus representative, the contribution is a large-scale, publicly available multi-modal benchmark that bridges computer vision and urban studies by focusing on spatial, social, and functional distinctions rather than generic scenes. The multi-task formulation, controlled scaling subsets, and public dataset/code release are strengths that enable reproducible evaluation and interdisciplinary follow-up work. The reported performance gaps across tasks and scales provide initial empirical signals about model limitations in urban contexts.

major comments (3)

[Abstract and §3] Abstract and §3 (Data Collection): The central claim that Urban-ImageNet supplies a valid benchmark for 'contemporary urban spaces' rests on the HUSIC taxonomy capturing central distinctions, yet the manuscript provides no details on the labeling process, inter-annotator agreement scores, or bias mitigation procedures for the 10-class hierarchy.
[§3 and §5] §3 (Dataset Curation) and §5 (Experiments): The 2M Weibo corpus and its subsets are asserted to represent typical urban spaces across 24 cities, but no quantitative validation (e.g., comparison to official land-use maps, demographic controls, or multi-source cross-checks) is reported; this leaves the representativeness claim vulnerable to known social-media selection biases toward salient/positive content.
[§5] §5 (Multi-scale Study): Performance trends are shown as training data grows from 1K to 100K images, but the results lack statistical significance testing, confidence intervals, or ablation against stronger baselines, weakening the interpretation of scaling behavior for the three tasks.

minor comments (2)

[§2] The HUSIC taxonomy is introduced as 'grounded in urban theory,' but the manuscript would benefit from explicit citations to the specific urban studies references that motivate each of the 10 classes.
[§5] Figure and table captions for the benchmark results could more clearly indicate which subsets (1K/10K/100K) correspond to each reported metric to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below, indicating the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Data Collection): The central claim that Urban-ImageNet supplies a valid benchmark for 'contemporary urban spaces' rests on the HUSIC taxonomy capturing central distinctions, yet the manuscript provides no details on the labeling process, inter-annotator agreement scores, or bias mitigation procedures for the 10-class hierarchy.

Authors: We agree that additional details on the taxonomy construction are necessary to support the benchmark's validity. In the revised manuscript, we will expand §3 to include a description of the labeling process, including the involvement of domain experts in urban studies, the iterative development of the HUSIC hierarchy, inter-annotator agreement scores computed on a sample of annotations, and bias mitigation strategies such as diverse annotator backgrounds and consensus-based labeling. This will be added without altering the core claims. revision: yes
Referee: [§3 and §5] §3 (Dataset Curation) and §5 (Experiments): The 2M Weibo corpus and its subsets are asserted to represent typical urban spaces across 24 cities, but no quantitative validation (e.g., comparison to official land-use maps, demographic controls, or multi-source cross-checks) is reported; this leaves the representativeness claim vulnerable to known social-media selection biases toward salient/positive content.

Authors: We acknowledge the limitation regarding quantitative validation of representativeness. The Weibo data inherently carries selection biases as user-generated content. In the revision, we will add a new subsection in §3 discussing these biases explicitly, including any available comparisons (e.g., city-level image distribution vs. population data), and clarify that the dataset serves as a benchmark for social media perceptions of urban spaces rather than a statistically representative sample of all urban environments. We will also include this in the limitations section. revision: partial
Referee: [§5] §5 (Multi-scale Study): Performance trends are shown as training data grows from 1K to 100K images, but the results lack statistical significance testing, confidence intervals, or ablation against stronger baselines, weakening the interpretation of scaling behavior for the three tasks.

Authors: We appreciate this suggestion for improving the rigor of our experimental analysis. In the updated §5, we will incorporate statistical significance testing (such as bootstrap confidence intervals and paired statistical tests) for the performance metrics across scales, report 95% confidence intervals, and perform additional ablations using stronger contemporary baselines (e.g., recent CLIP variants or segmentation models). These additions will better substantiate the scaling observations. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset and benchmark construction is self-contained

full rationale

The paper introduces a new multi-modal dataset from Weibo imagery, defines the HUSIC taxonomy from urban theory literature, and specifies three independent benchmark tasks (semantic classification, cross-modal retrieval, instance segmentation) with standard model evaluations. No equations, fitted parameters, predictions, or derivations are present that could reduce to inputs by construction. The central contribution is data curation and task formulation rather than any self-referential result; external benchmarks and model comparisons are performed on off-the-shelf architectures without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on the new dataset collection and the introduced taxonomy without additional free parameters or mathematical derivations.

axioms (1)

domain assumption HUSIC taxonomy is grounded in urban theory and distinguishes key urban space types
Stated in the abstract as the basis for the 10-class organization.

invented entities (1)

HUSIC (Hierarchical Urban Space Image Classification) framework no independent evidence
purpose: To provide a 10-class taxonomy for urban imagery based on spatial, social, and functional distinctions
Newly defined in this paper for the dataset organization.

pith-pipeline@v0.9.0 · 5666 in / 1355 out tokens · 85668 ms · 2026-05-12T03:48:42.870210+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[1]

Google street view: Capturing the world at street level.Computer, 43(6):32–38, 2010

Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh, Stéphane Lafon, Richard Lyon, Abhijit Ogale, Luc Vincent, and Josh Weaver. Google street view: Capturing the world at street level.Computer, 43(6):32–38, 2010

work page 2010
[2]

Servicescapes: The impact of physical surroundings on customers and employ- ees.Journal of Marketing, 56(2):57–71, 1992

Mary Jo Bitner. Servicescapes: The impact of physical surroundings on customers and employ- ees.Journal of Marketing, 56(2):57–71, 1992. doi: 10.1177/002224299205600205

work page doi:10.1177/002224299205600205 1992
[3]

Boy and Justus Uitermark

John D. Boy and Justus Uitermark. Reassembling the city through Instagram.Transactions of the Institute of British Geographers, 42(2):612–624, 2017. doi: 10.1111/tran.12185

work page doi:10.1111/tran.12185 2017
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 1901
[5]

Cascade R-CNN: High quality object detection and instance segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5): 1483–1498, 2021

Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: High quality object detection and instance segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5): 1483–1498, 2021. doi: 10.1109/TPAMI.2019.2956516

work page doi:10.1109/tpami.2019.2956516 2021
[6]

The Cityscapes dataset for semantic urban scene understanding,

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016. doi: 10.1109/CVPR.2016.350

work page doi:10.1109/cvpr.2016.350 2016
[7]

ImageNet:

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[8]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[9]

Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, and César A. Hidalgo. Deep learning the city: Quantifying urban perception at a global scale. InEuropean Conference on Computer Vision (ECCV), pages 196–212. Springer, 2016. doi: 10.1007/978-3-319-46448-0_ 12

work page doi:10.1007/978-3-319-46448-0_ 2016
[10]

Island Press, Washington, DC, 6th edition, 2011

Jan Gehl.Life Between Buildings: Using Public Space. Island Press, Washington, DC, 6th edition, 2011

work page 2011
[11]

Anchor Books, New York, 1959

Erving Goffman.The Presentation of Self in Everyday Life. Anchor Books, New York, 1959

work page 1959
[12]

LVIS: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5356–5364, 2019. doi: 10.1109/CVPR.2019.00550

work page doi:10.1109/cvpr.2019.00550 2019
[13]

UrbanFeel: A comprehensive benchmark for temporal and perceptual understanding of city scenes through human perspective.arXiv preprint arXiv:2509.22228, 2025

Jun He, Yi Lin, Zilong Huang, Jiacong Yin, Junyan Ye, Yuchuan Zhou, Weijia Li, and Xiang Zhang. UrbanFeel: A comprehensive benchmark for temporal and perceptual understanding of city scenes through human perspective.arXiv preprint arXiv:2509.22228, 2025. URL https://arxiv.org/abs/2509.22228

work page arXiv 2025
[14]

Deep Residual Learning for Image Recognition , isbn =

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90. 10

work page doi:10.1109/cvpr.2016.90 2016
[15]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. InIEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017. doi: 10.1109/ ICCV .2017.322

work page 2017
[16]

Cambridge University Press, Cambridge, 1984

Bill Hillier and Julienne Hanson.The Social Logic of Space. Cambridge University Press, Cambridge, 1984

work page 1984
[17]

Zooming into an Instagram city: Reading the local through social media.First Monday, 18(7), 2013

Nadav Hochman and Lev Manovich. Zooming into an Instagram city: Reading the local through social media.First Monday, 18(7), 2013. doi: 10.5210/fm.v18i7.4711

work page doi:10.5210/fm.v18i7.4711 2013
[18]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[19]

2023 , url =

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InIEEE International Conference on Computer Vision (ICCV), pages 4015–4026, 2023. doi: 10.1109/ICCV51070.2023.00371

work page doi:10.1109/iccv51070.2023.00371 2023
[20]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977. doi: 10.2307/2529310

work page doi:10.2307/2529310 1977
[21]

Blackwell, Oxford, 1991

Henri Lefebvre.The Production of Space. Blackwell, Oxford, 1991. Translated by D. Nicholson- Smith

work page 1991
[22]

BLIP: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational Conference on Machine Learning (ICML), pages 12888–12900, 2022

work page 2022
[23]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[24]

Lawrence , editor =

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014. doi: 10.1007/978-3-319-10602-1_48

work page doi:10.1007/978-3-319-10602-1_48 2014
[25]

Visual instruction tuning (LLaV A)

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning (LLaV A). InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[26]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

MIT Press, Cambridge, MA, 1960

Kevin Lynch.The Image of the City. MIT Press, Cambridge, MA, 1960

work page 1960
[28]

Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. InIEEE International Conference on Computer Vision (ICCV), pages 4990–4999, 2017. doi: 10.1109/ICCV .2017.534

work page doi:10.1109/iccv 2017
[29]

Macmillan, New York, 1972

Oscar Newman.Defensible Space: Crime Prevention through Urban Design. Macmillan, New York, 1972

work page 1972
[30]

MMS-VPR: Multimodal street-level visual place recognition dataset and benchmark.arXiv preprint arXiv:2505.12254, 2025

Yiwei Ou, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, and Manfredo Manfredini. MMS-VPR: Multimodal street-level visual place recognition dataset and benchmark.arXiv preprint arXiv:2505.12254, 2025

work page arXiv 2025
[31]

Chinese city tier ranking scheme as special spatial factor of innovations diffusion.International Review, (1-2):88–99, 2024

Zoltán Peredy, Sijia Li, and László Vígh. Chinese city tier ranking scheme as special spatial factor of innovations diffusion.International Review, (1-2):88–99, 2024

work page 2024
[32]

Reid, and Silvio Savarese

Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 413–420, 2009. doi: 10.1109/CVPR. 2009.5206537. 11

work page doi:10.1109/cvpr 2009
[33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning (ICML), pages 8748–8763, 2021

work page 2021
[34]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Romain Beaumont, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In...

work page 2022
[36]

EfficientNet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural networks. InInternational Conference on Machine Learning (ICML), pages 6105–6114, 2019

work page 2019
[37]

Training data-efficient image transformers & distillation through attention (DeiT)

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolves, and Hervé Jégou. Training data-efficient image transformers & distillation through attention (DeiT). InInternational Conference on Machine Learning (ICML), pages 10347–10357, 2021

work page 2021
[38]

Whyte.The Social Life of Small Urban Spaces

William H. Whyte.The Social Life of Small Urban Spaces. Conservation Foundation, Washing- ton, DC, 1980

work page 1980
[39]

Ehinger, Aude Oliva, and Antonio Torralba

Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN database: Large-scale scene recognition from abbey to zoo. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3485–3492, 2010. doi: 10.1109/CVPR.2010. 5539970

work page doi:10.1109/cvpr.2010 2010
[40]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. doi: 10.1162/ tacl_a_00166

work page 2014
[41]

Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2018

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2018. doi: 10.1109/TPAMI.2017.2723009

work page doi:10.1109/tpami.2017.2723009 2018
[42]

This is a photo of {class_name}

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset.International Journal of Computer Vision, 127(3):302–321, 2019. doi: 10.1007/s11263-018-1140-0. 12 A Full Experimental Results A.1 Task 1: Per-Class Diagnostics Figure 7 shows per-class F1 scores fo...

work page doi:10.1007/s11263-018-1140-0 2019
[43]

Class-agnostic T3 evaluation.Instance segmentation is benchmarked under a single object category; per-class breakdown would require higher-quality human-annotated ground truth than current pseudo-labels support

work page
[44]

Geographic restriction to Chinese cities.All 61 venues are located across 24 Chinese cities; whether theHUSICtaxonomy and learned representations generalise to other cultural or urban contexts requires future geographic expansion

work page
[45]

Researchers requiring balanced training at scale should use the 100K tier

Class imbalance in the 2M corpus.The full 2M corpus is class-imbalanced by construction, reflecting real-world social media frequency distributions (non-spatial classes each comprising ≈15–25% of posts; all spatially relevant classes collectively ≈40%). Researchers requiring balanced training at scale should use the 100K tier

work page
[46]

Incomplete LLaV A-1.5 100K training.100K fine-tuning of LLaV A-1.5 was not completed due to computational constraints; 1K and 10K results are reported but 100K results are unavailable

work page
[47]

T3 SAM oracle circularity.The GT-box SAM oracle (AP = 0.749) partially reflects circularity, as evaluation pseudo-labels were generated by SAM and the oracle uses SAM with perfect box prompts. The Cascade Mask R-CNN and SAM box-refinement results—trained on noisy pseudo-labels and evaluated against stricter-threshold human-audited annotations—provide the ...

work page
[48]

open to all

Chinese-language social media text.Post-text retrieval operates on original Chinese Weibo posts. Current baselines (CLIP, BLIP, BLIP-2) were pre-trained predominantly on English data, which partly explains the low absolute post-level retrieval scores and motivates future bilingual or multilingual urban-domain pre-training. G The HUSIC 10-Class Framework T...

work page
[49]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page