pith. sign in

arxiv: 2505.11809 · v3 · pith:IJFKWILUnew · submitted 2025-05-17 · 💻 cs.CV

From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models

Pith reviewed 2026-05-22 14:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords urban visibility analysisstreet view imageryvision-language modelslandmark detectionvisibility graphurban planningvisual connectivity
0
0 comments X

The pith

Vision-language models applied to street view images detect urban landmarks at 87 percent accuracy and map their visual connections in a graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to turn the widespread availability of street view photos into a practical way to measure which real-world viewpoints can see specific city landmarks. Instead of building detailed 3D models for line-of-sight calculations, the approach feeds a reference image of a landmark into a vision-language model and checks for detections in direction-controlled street views. Successful detections are treated as evidence of visibility and are assembled into a graph that links landmarks, viewpoints, and the urban spaces that connect them. In tests on six well-known structures the method reached 87 percent overall detection accuracy and 68 percent precision on visible locations, while a London case study found bridges mediating roughly 31 percent of multi-landmark visual links. The result offers a workable alternative for cities that lack high-quality 3D data yet still need visibility information for planning or conservation.

Core claim

Reformulating landmark visibility as an image-based detection task lets a vision-language model scan street view imagery to identify visible locations, which are then assembled into a heterogeneous visibility graph that records where landmarks are seen, how strongly they connect, and which urban spaces mediate those connections.

What carries the argument

The heterogeneous visibility graph built from vision-language model detections in street view images, which links landmarks, viewpoints, and mediating urban spaces to represent visual connectivity.

If this is right

  • Visibility assessment becomes feasible in cities without accurate 3D models because street view imagery is already widely available.
  • The resulting graph quantifies both single-landmark visibility and joint visual connections through shared corridors.
  • Locations such as bridges can be ranked by how many multi-landmark visual links they mediate, as shown by the 31 percent figure along the Thames.
  • The same detection-plus-graph pipeline can be repeated for heritage or planning studies that need to know which viewpoints preserve sightlines to multiple landmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The graph structure could be combined with pedestrian movement data to suggest routes that maximize or minimize exposure to certain landmarks.
  • Extending the method to non-landmark objects such as public art or signage would create broader visual-network maps of everyday streetscapes.
  • Because the approach works on existing imagery, it could support repeated mapping over time to track how new buildings or vegetation alter visual connections.

Load-bearing premise

Detection of a landmark by the vision-language model in a street view image means the landmark is actually visible from that real-world viewpoint.

What would settle it

A side-by-side comparison in which human observers walk the same street-view locations and report whether they can see the target landmark, checking for cases where the model detects the landmark but people cannot.

Figures

Figures reproduced from arXiv: 2505.11809 by Fan Zhang, Filip Biljecki, Kunihiko Fujiwara, Pengyuan Liu, Zicheng Fan.

Figure 1
Figure 1. Figure 1: A research framework for the study. Imagery: Google Street View, Wikimedia Com [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of locating and detecting the visibility of a distant landmark from SVI. Im [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualisation of the landmark detection process. (a) Locating landmarks using bounding [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Different types included in the graph definition. Right: An SVI taken near the Lon￾don Bridge, showing how SVI can represent multiple edge relations via a single image. Imagery: Google Street View. 3.2.2. Advanced Relationship Represented by Graph Three advanced relationships can be represented based on the node and edge definitions to describe more complex spatial and visual interactions between lan… view at source ↗
Figure 5
Figure 5. Figure 5: Left: Query images of different landmarks investigated in the second case study. Im￾agery: Wikimedia Commons. Right: The spatial distribution of the selected landmarks along the River Thames. Basemap: © OpenStreetMap contributors. 4.2.1. Different Roles of Landmarks in Riverside Landscape A common experience applicable to the selected landmarks along the Thames River is that they are both part of the lands… view at source ↗
Figure 6
Figure 6. Figure 6: A comparison of distribution for SVI-based landmark visibility, 3D simulated landmark [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Plots comparing the performance and foreground visual element di [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Plots illustrating the distribution difference of landmark-visible SVI locations and Flickr images and the related socio-economic context. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Inter-visibility between modern and historical landmarks along the River Thames, based [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Forms and frequency that landmarks are included in visual co-existence relationship. [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: VAV paths searched with random walk analysis. [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗
read the original abstract

Visibility analysis in urban planning has traditionally relied on line-of-sight (LoS) simulations, which capture geometric occlusion. However, these approaches depend on accurate 3D data that is often unavailable and may not adequately represent how visually distinctive urban landmarks are encountered in real streetscapes. We reformulate landmark visibility assessment as an urban visual search problem in image space by leveraging the widespread availability of street view imagery (SVI). Given a reference image of a target landmark, a Vision Language Model (VLM) is applied to detect the landmark in direction- and zoom-controlled SVI. A successful detection indicates machine-recognised landmark visibility at the corresponding viewpoint. Beyond isolated viewpoints, we construct a heterogeneous visibility graph to represent visual connectivity among landmarks, street-view locations, and the urban spaces that mediate them. This graph enables us to map where visual connections occur, how strong they are, and how multiple landmarks become jointly connected through shared visual corridors. Across six well-known landmark structures in global cities, the image-based method achieves an overall detection accuracy of 87%, with a precision score of 68% for landmark-visible locations. In a second case study along the River Thames in London, the visibility graph reveals multi-landmark connections and identifies key mediating locations, with bridges accounting for approximately 31% of all connections. The proposed method complements LoS-based visibility analysis and offers a practical alternative in data-constrained settings. It also showcases the possibility of revealing the prevalent connections of visual objects in the urban environment, opening new perspectives for urban planning and heritage conservation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reformulates urban landmark visibility assessment as an image-based detection task using pre-trained vision-language models on street view imagery (SVI). Successful VLM detections on direction- and zoom-controlled frames are taken to indicate machine-recognized visibility at the corresponding real-world viewpoint. The method is evaluated on six well-known landmarks across global cities, reporting 87% overall detection accuracy and 68% precision for visible locations. It further constructs a heterogeneous visibility graph to map visual connections among landmarks, viewpoints, and mediating urban spaces, with a London River Thames case study finding that bridges account for ~31% of connections. The approach is positioned as a complement to line-of-sight (LoS) geometric simulations, especially where accurate 3D data is unavailable.

Significance. If the core mapping from VLM detection to actual human-perceived visibility can be substantiated, the work offers a scalable, low-data alternative for visibility analysis in urban planning and heritage conservation. The visibility graph component enables network-level insights into multi-landmark visual corridors and mediating locations that geometric LoS methods do not directly provide. A clear strength is the reliance on existing pre-trained VLMs and publicly available SVI without introducing new fitted parameters or custom training, supporting reproducibility.

major comments (2)
  1. [Abstract] Abstract: the reported overall detection accuracy of 87% and precision of 68% for landmark-visible locations are presented without any information on ground-truth collection protocol, validation dataset size, inter-rater agreement, or handling of false positives/negatives. This detail is load-bearing for the central claim, because the method's utility rests on the assumption that VLM success on an SVI frame reliably indicates real-world visual accessibility rather than model priors, contextual cues, or recognition of partially occluded or low-resolution appearances.
  2. [Case study / visibility graph section] The description of the visibility graph construction (nodes for landmarks, SVI locations, and mediating spaces; edges for detected visual connections) does not specify how edge weights are computed from detection scores or how the graph distinguishes direct versus mediated visibility. Without this, the quantitative claim that bridges account for 31% of connections in the Thames case study cannot be fully evaluated for robustness against alternative graph definitions.
minor comments (2)
  1. [Abstract] The abstract introduces the 'heterogeneous visibility graph' without a concise definition of its node and edge types or a reference to the graph-construction procedure used later in the manuscript.
  2. [Abstract] Notation for 'landmark-visible locations' versus 'detection success' could be clarified to avoid conflating the two concepts in the results summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We respond to each major comment in turn, indicating where we will make revisions to improve the clarity and completeness of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported overall detection accuracy of 87% and precision of 68% for landmark-visible locations are presented without any information on ground-truth collection protocol, validation dataset size, inter-rater agreement, or handling of false positives/negatives. This detail is load-bearing for the central claim, because the method's utility rests on the assumption that VLM success on an SVI frame reliably indicates real-world visual accessibility rather than model priors, contextual cues, or recognition of partially occluded or low-resolution appearances.

    Authors: We agree that providing more information on the validation process in the abstract would help substantiate the reported metrics. The manuscript's methods section details the ground-truth collection through manual annotation of SVI frames and the use of multiple views to handle potential false positives from occlusions or low resolution. To directly address this, we will revise the abstract to include a short reference to the validation approach, such as noting that the accuracy is based on a manually annotated test set. We will also ensure the methods section explicitly discusses inter-rater agreement and false positive mitigation strategies. This revision will strengthen the presentation of the central claim. revision: yes

  2. Referee: [Case study / visibility graph section] The description of the visibility graph construction (nodes for landmarks, SVI locations, and mediating spaces; edges for detected visual connections) does not specify how edge weights are computed from detection scores or how the graph distinguishes direct versus mediated visibility. Without this, the quantitative claim that bridges account for 31% of connections in the Thames case study cannot be fully evaluated for robustness against alternative graph definitions.

    Authors: We thank the referee for highlighting the need for more precise technical details on the graph. The manuscript describes the heterogeneous graph with nodes for landmarks, viewpoints, and mediating spaces, with edges based on successful VLM detections. However, we acknowledge that the exact computation of edge weights from detection scores and the formal distinction between direct and mediated visibility could be more explicitly stated. We will revise the relevant section to include the formula for edge weights (e.g., averaged detection confidence) and clarify that direct visibility is represented by direct edges while mediated visibility involves paths through mediating nodes. This will enable better evaluation of the 31% bridges finding under different graph specifications. We will also consider adding a supplementary figure to illustrate the graph structure. revision: yes

Circularity Check

0 steps flagged

Empirical VLM application to landmark visibility shows no derivation circularity

full rationale

The paper applies existing pre-trained vision-language models to direction- and zoom-controlled street view imagery to detect landmarks, reporting empirical outcomes such as 87% overall detection accuracy and 68% precision across six global landmarks. No equations, fitted parameters, or derivations are introduced that reduce these results to quantities defined or fitted within the paper itself. The visibility graph is constructed directly from successful detections as a representational tool for visual connectivity, without self-referential loops or load-bearing self-citations that would force the central claims. The method is presented as a practical complement to LoS simulations in data-constrained settings, relying on off-the-shelf VLM inference rather than any internal ansatz or uniqueness theorem. This constitutes a standard empirical application self-contained against external image benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that VLM detection success equates to meaningful visual visibility in real streetscapes; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption VLM detection of a landmark in a controlled street view image indicates real-world visual visibility from that viewpoint
    This premise is required to interpret the 87% accuracy as evidence of visibility rather than pure image classification performance.

pith-pipeline@v0.9.0 · 5829 in / 1234 out tokens · 71854 ms · 2026-05-22T14:56:01.962454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.,

    URL:https://www.sciencedirect.com/science/article/pii/ S1296207420304337, doi:10.1016/j.culher.2020.08.002. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.,

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Lo- calization, Text Reading, and Beyond. URL:http://arxiv.org/abs/2308. 12966, doi:10.48550/arXiv.2308.12966. arXiv:2308.12966 [cs]. Bartie, P., Mackaness, W., Petrenz, P., Dickinson, A., 2015. Identi- fying related landmark tags in urban scenes using spatial and seman- tic clustering. Compute...

  3. [3]

    Batty, M., 2001

    URL:https://www.sciencedirect.com/science/article/pii/ S0198971515000381, doi:10.1016/j.compenvurbsys.2015.03.003. Batty, M., 2001. Exploring Isovist Fields: Space and Shape in Architectural and Urban Morphology. Environment and Planning B: Planning and Design 28, 123–150. URL:https://doi.org/10.1068/b2725, doi:10.1068/b2725. 44 Belcher, R.N., Murray, K.A...

  4. [4]

    45 Environment Agency, 2024a

    URL:https://www.degruyterbrill.com/document/doi/10.1515/ opar-2019-0014/html, doi:10.1515/opar-2019-0014. 45 Environment Agency, 2024a. Lidar composite digital surface model (dsm) - 1m. URL:https://environment.data.gov.uk/dataset/ 9ba4d5ac-d596-445a-9056-dae3ddec0178. accessed: 2024-11-13. Environment Agency, 2024b. Lidar composite digital terrain model (...

  5. [5]

    ESRI, 2021

    URL:https://www.nature.com/articles/nn.4656, doi:10.1038/ nn.4656. ESRI, 2021. Visibility (Spatial Analyst)—ArcMap|Documentation. URL:https://desktop.arcgis.com/en/arcmap/latest/tools/ spatial-analyst-toolbox/visibility.htm. Evans, G.W., , Catherine, S., , Pezdek, K., 1982. Cognitive Maps and Ur- ban Form. Journal of the American Planning Association 48, ...

  6. [6]

    Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M., 2024

    URL:https://www.sciencedirect.com/science/article/pii/ S0198971518301881, doi:10.1016/j.compenvurbsys.2018.11.009. Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M., 2024. SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. URL: http://arxiv.org/abs/2311.17179, doi:10.48550/arXiv.2311.17179. arXiv:2311.17179 [cs]. Klouˇ...

  7. [7]

    Lu, Y ., Gou, Z., Ye, Y ., Sheng, Q., 2019

    URL:https://www.sciencedirect.com/science/article/pii/ S0198971519300493, doi:10.1016/j.compenvurbsys.2019.04.009. Lu, Y ., Gou, Z., Ye, Y ., Sheng, Q., 2019. Three-dimensional visibility graph analysis and its application. Environment and Planning B 46, 948–

  8. [8]

    50 Lynch, K., 1996

    URL:https://doi.org/10.1177/2399808317739893, doi:10.1177/ 2399808317739893. 50 Lynch, K., 1996. The image of the city. The MIT Press, Massachusetts Institute of Technology, Cambridge, Massachusetts ; London, England. Manahasa, E., , Manahasa, O., 2024. The role of landmarks in shaping Tirana’s urban identity: the shift from socialist to post-socialist ci...

  9. [9]

    Milojevic-Dupont, N., Wagner, F., Nachtigall, F., Hu, J., Br¨user, G.B., Zumwald, M., Biljecki, F., Heeren, N., Kaack, L.H., Pichler, P.P., Creutzig, F., 2023

    URL:https://www.sciencedirect.com/science/article/pii/ S2212095518301883, doi:10.1016/j.uclim.2018.05.004. Milojevic-Dupont, N., Wagner, F., Nachtigall, F., Hu, J., Br¨user, G.B., Zumwald, M., Biljecki, F., Heeren, N., Kaack, L.H., Pichler, P.P., Creutzig, F., 2023. EU- BUCCO v0.1: European building stock characteristics in a common and open database for ...

  10. [10]

    number: 5

    URL:https://www.mdpi.com/2220-9964/10/5/275, doi:10.3390/ ijgi10050275. number: 5. Morello, E., Ratti, C., 2009. A Digital Image of the City: 3D Isovists in Lynch’s Urban Analysis. Environment and Planning B: Planning and Design 36, 837–853. URL:https://journals.sagepub.com/action/showAbstract, doi:10.1068/b34144t. 51 Natapov, A., Czamanski, D., Fisher-Ge...

  11. [11]

    Omrani Azizabad, S., Mahdavinejad, M., Hadighi, M., 2024

    URL:https://www.sciencedirect.com/science/article/pii/ S0198971507000555, doi:10.1016/j.compenvurbsys.2007.08.004. Omrani Azizabad, S., Mahdavinejad, M., Hadighi, M., 2024. Three-dimensional embodied visibility graph analysis: Investigating and analyzing values along an outdoor path. Environment and Planning B , 23998083241303199URL: https://doi.org/10.11...

  12. [12]

    Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N., 2020

    URL:https://www.sciencedirect.com/science/article/pii/ S0169204607001399, doi:10.1016/j.landurbplan.2007.05.010. Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N., 2020. Label Stu- dio: Data labeling software. URL:https://github.com/HumanSignal/ label-studio. Tolan, J., Yang, H.I., Nosarzewski, B., Couairon, G., V o, H.V ., Brandt, J., Spore, J., Maj...

  13. [13]

    CogVLM: Visual Expert for Pretrained Language Models

    CogVLM: Visual Expert for Pretrained Language Models. URL: http://arxiv.org/abs/2311.03079, doi:10.48550/arXiv.2311.03079. arXiv:2311.03079 [cs]. Wiener, J.M., Franz, G., 2005. Isovists as a Means to Predict Spatial Experi- ence and Behavior, in: Freksa, C., Knauff, M., Krieg-Br¨uckner, B., Nebel, B., Barkowsky, T. (Eds.), Spatial Cognition IV. Reasoning,...

  14. [14]

    1007/s10339-021-01012-x

    URL:https://doi.org/10.1007/s10339-021-01012-x, doi:10. 1007/s10339-021-01012-x. Zhao, Y ., Wu, B., Wu, J., Shu, S., Liang, H., Liu, M., Badenko, V ., Fe- dotov, A., Yao, S., Yu, B., 2020. Mapping 3D visibility in an ur- ban street environment from mobile LiDAR point clouds. GIScience & Remote Sensing 57, 797–812. URL:https://doi.org/10.1080/ 15481603.202...