From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models
Pith reviewed 2026-05-22 14:56 UTC · model grok-4.3
The pith
Vision-language models applied to street view images detect urban landmarks at 87 percent accuracy and map their visual connections in a graph.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reformulating landmark visibility as an image-based detection task lets a vision-language model scan street view imagery to identify visible locations, which are then assembled into a heterogeneous visibility graph that records where landmarks are seen, how strongly they connect, and which urban spaces mediate those connections.
What carries the argument
The heterogeneous visibility graph built from vision-language model detections in street view images, which links landmarks, viewpoints, and mediating urban spaces to represent visual connectivity.
If this is right
- Visibility assessment becomes feasible in cities without accurate 3D models because street view imagery is already widely available.
- The resulting graph quantifies both single-landmark visibility and joint visual connections through shared corridors.
- Locations such as bridges can be ranked by how many multi-landmark visual links they mediate, as shown by the 31 percent figure along the Thames.
- The same detection-plus-graph pipeline can be repeated for heritage or planning studies that need to know which viewpoints preserve sightlines to multiple landmarks.
Where Pith is reading between the lines
- The graph structure could be combined with pedestrian movement data to suggest routes that maximize or minimize exposure to certain landmarks.
- Extending the method to non-landmark objects such as public art or signage would create broader visual-network maps of everyday streetscapes.
- Because the approach works on existing imagery, it could support repeated mapping over time to track how new buildings or vegetation alter visual connections.
Load-bearing premise
Detection of a landmark by the vision-language model in a street view image means the landmark is actually visible from that real-world viewpoint.
What would settle it
A side-by-side comparison in which human observers walk the same street-view locations and report whether they can see the target landmark, checking for cases where the model detects the landmark but people cannot.
Figures
read the original abstract
Visibility analysis in urban planning has traditionally relied on line-of-sight (LoS) simulations, which capture geometric occlusion. However, these approaches depend on accurate 3D data that is often unavailable and may not adequately represent how visually distinctive urban landmarks are encountered in real streetscapes. We reformulate landmark visibility assessment as an urban visual search problem in image space by leveraging the widespread availability of street view imagery (SVI). Given a reference image of a target landmark, a Vision Language Model (VLM) is applied to detect the landmark in direction- and zoom-controlled SVI. A successful detection indicates machine-recognised landmark visibility at the corresponding viewpoint. Beyond isolated viewpoints, we construct a heterogeneous visibility graph to represent visual connectivity among landmarks, street-view locations, and the urban spaces that mediate them. This graph enables us to map where visual connections occur, how strong they are, and how multiple landmarks become jointly connected through shared visual corridors. Across six well-known landmark structures in global cities, the image-based method achieves an overall detection accuracy of 87%, with a precision score of 68% for landmark-visible locations. In a second case study along the River Thames in London, the visibility graph reveals multi-landmark connections and identifies key mediating locations, with bridges accounting for approximately 31% of all connections. The proposed method complements LoS-based visibility analysis and offers a practical alternative in data-constrained settings. It also showcases the possibility of revealing the prevalent connections of visual objects in the urban environment, opening new perspectives for urban planning and heritage conservation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reformulates urban landmark visibility assessment as an image-based detection task using pre-trained vision-language models on street view imagery (SVI). Successful VLM detections on direction- and zoom-controlled frames are taken to indicate machine-recognized visibility at the corresponding real-world viewpoint. The method is evaluated on six well-known landmarks across global cities, reporting 87% overall detection accuracy and 68% precision for visible locations. It further constructs a heterogeneous visibility graph to map visual connections among landmarks, viewpoints, and mediating urban spaces, with a London River Thames case study finding that bridges account for ~31% of connections. The approach is positioned as a complement to line-of-sight (LoS) geometric simulations, especially where accurate 3D data is unavailable.
Significance. If the core mapping from VLM detection to actual human-perceived visibility can be substantiated, the work offers a scalable, low-data alternative for visibility analysis in urban planning and heritage conservation. The visibility graph component enables network-level insights into multi-landmark visual corridors and mediating locations that geometric LoS methods do not directly provide. A clear strength is the reliance on existing pre-trained VLMs and publicly available SVI without introducing new fitted parameters or custom training, supporting reproducibility.
major comments (2)
- [Abstract] Abstract: the reported overall detection accuracy of 87% and precision of 68% for landmark-visible locations are presented without any information on ground-truth collection protocol, validation dataset size, inter-rater agreement, or handling of false positives/negatives. This detail is load-bearing for the central claim, because the method's utility rests on the assumption that VLM success on an SVI frame reliably indicates real-world visual accessibility rather than model priors, contextual cues, or recognition of partially occluded or low-resolution appearances.
- [Case study / visibility graph section] The description of the visibility graph construction (nodes for landmarks, SVI locations, and mediating spaces; edges for detected visual connections) does not specify how edge weights are computed from detection scores or how the graph distinguishes direct versus mediated visibility. Without this, the quantitative claim that bridges account for 31% of connections in the Thames case study cannot be fully evaluated for robustness against alternative graph definitions.
minor comments (2)
- [Abstract] The abstract introduces the 'heterogeneous visibility graph' without a concise definition of its node and edge types or a reference to the graph-construction procedure used later in the manuscript.
- [Abstract] Notation for 'landmark-visible locations' versus 'detection success' could be clarified to avoid conflating the two concepts in the results summary.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We respond to each major comment in turn, indicating where we will make revisions to improve the clarity and completeness of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported overall detection accuracy of 87% and precision of 68% for landmark-visible locations are presented without any information on ground-truth collection protocol, validation dataset size, inter-rater agreement, or handling of false positives/negatives. This detail is load-bearing for the central claim, because the method's utility rests on the assumption that VLM success on an SVI frame reliably indicates real-world visual accessibility rather than model priors, contextual cues, or recognition of partially occluded or low-resolution appearances.
Authors: We agree that providing more information on the validation process in the abstract would help substantiate the reported metrics. The manuscript's methods section details the ground-truth collection through manual annotation of SVI frames and the use of multiple views to handle potential false positives from occlusions or low resolution. To directly address this, we will revise the abstract to include a short reference to the validation approach, such as noting that the accuracy is based on a manually annotated test set. We will also ensure the methods section explicitly discusses inter-rater agreement and false positive mitigation strategies. This revision will strengthen the presentation of the central claim. revision: yes
-
Referee: [Case study / visibility graph section] The description of the visibility graph construction (nodes for landmarks, SVI locations, and mediating spaces; edges for detected visual connections) does not specify how edge weights are computed from detection scores or how the graph distinguishes direct versus mediated visibility. Without this, the quantitative claim that bridges account for 31% of connections in the Thames case study cannot be fully evaluated for robustness against alternative graph definitions.
Authors: We thank the referee for highlighting the need for more precise technical details on the graph. The manuscript describes the heterogeneous graph with nodes for landmarks, viewpoints, and mediating spaces, with edges based on successful VLM detections. However, we acknowledge that the exact computation of edge weights from detection scores and the formal distinction between direct and mediated visibility could be more explicitly stated. We will revise the relevant section to include the formula for edge weights (e.g., averaged detection confidence) and clarify that direct visibility is represented by direct edges while mediated visibility involves paths through mediating nodes. This will enable better evaluation of the 31% bridges finding under different graph specifications. We will also consider adding a supplementary figure to illustrate the graph structure. revision: yes
Circularity Check
Empirical VLM application to landmark visibility shows no derivation circularity
full rationale
The paper applies existing pre-trained vision-language models to direction- and zoom-controlled street view imagery to detect landmarks, reporting empirical outcomes such as 87% overall detection accuracy and 68% precision across six global landmarks. No equations, fitted parameters, or derivations are introduced that reduce these results to quantities defined or fitted within the paper itself. The visibility graph is constructed directly from successful detections as a representational tool for visual connectivity, without self-referential loops or load-bearing self-citations that would force the central claims. The method is presented as a practical complement to LoS simulations in data-constrained settings, relying on off-the-shelf VLM inference rather than any internal ansatz or uniqueness theorem. This constitutes a standard empirical application self-contained against external image benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLM detection of a landmark in a controlled street view image indicates real-world visual visibility from that viewpoint
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a Vision Language Model (VLM) is applied to detect the landmark in direction- and zoom-controlled SVI. A successful detection indicates machine-recognised landmark visibility
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
heterogeneous visibility graph ... Inter-visibility, Visual Co-existence, Visible–Accessible–Visible (VAV) Path
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.,
URL:https://www.sciencedirect.com/science/article/pii/ S1296207420304337, doi:10.1016/j.culher.2020.08.002. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.,
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Lo- calization, Text Reading, and Beyond. URL:http://arxiv.org/abs/2308. 12966, doi:10.48550/arXiv.2308.12966. arXiv:2308.12966 [cs]. Bartie, P., Mackaness, W., Petrenz, P., Dickinson, A., 2015. Identi- fying related landmark tags in urban scenes using spatial and seman- tic clustering. Compute...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12966 2015
-
[3]
URL:https://www.sciencedirect.com/science/article/pii/ S0198971515000381, doi:10.1016/j.compenvurbsys.2015.03.003. Batty, M., 2001. Exploring Isovist Fields: Space and Shape in Architectural and Urban Morphology. Environment and Planning B: Planning and Design 28, 123–150. URL:https://doi.org/10.1068/b2725, doi:10.1068/b2725. 44 Belcher, R.N., Murray, K.A...
-
[4]
URL:https://www.degruyterbrill.com/document/doi/10.1515/ opar-2019-0014/html, doi:10.1515/opar-2019-0014. 45 Environment Agency, 2024a. Lidar composite digital surface model (dsm) - 1m. URL:https://environment.data.gov.uk/dataset/ 9ba4d5ac-d596-445a-9056-dae3ddec0178. accessed: 2024-11-13. Environment Agency, 2024b. Lidar composite digital terrain model (...
-
[5]
URL:https://www.nature.com/articles/nn.4656, doi:10.1038/ nn.4656. ESRI, 2021. Visibility (Spatial Analyst)—ArcMap|Documentation. URL:https://desktop.arcgis.com/en/arcmap/latest/tools/ spatial-analyst-toolbox/visibility.htm. Evans, G.W., , Catherine, S., , Pezdek, K., 1982. Cognitive Maps and Ur- ban Form. Journal of the American Planning Association 48, ...
-
[6]
Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M., 2024
URL:https://www.sciencedirect.com/science/article/pii/ S0198971518301881, doi:10.1016/j.compenvurbsys.2018.11.009. Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M., 2024. SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. URL: http://arxiv.org/abs/2311.17179, doi:10.48550/arXiv.2311.17179. arXiv:2311.17179 [cs]. Klouˇ...
-
[7]
Lu, Y ., Gou, Z., Ye, Y ., Sheng, Q., 2019
URL:https://www.sciencedirect.com/science/article/pii/ S0198971519300493, doi:10.1016/j.compenvurbsys.2019.04.009. Lu, Y ., Gou, Z., Ye, Y ., Sheng, Q., 2019. Three-dimensional visibility graph analysis and its application. Environment and Planning B 46, 948–
-
[8]
URL:https://doi.org/10.1177/2399808317739893, doi:10.1177/ 2399808317739893. 50 Lynch, K., 1996. The image of the city. The MIT Press, Massachusetts Institute of Technology, Cambridge, Massachusetts ; London, England. Manahasa, E., , Manahasa, O., 2024. The role of landmarks in shaping Tirana’s urban identity: the shift from socialist to post-socialist ci...
-
[9]
URL:https://www.sciencedirect.com/science/article/pii/ S2212095518301883, doi:10.1016/j.uclim.2018.05.004. Milojevic-Dupont, N., Wagner, F., Nachtigall, F., Hu, J., Br¨user, G.B., Zumwald, M., Biljecki, F., Heeren, N., Kaack, L.H., Pichler, P.P., Creutzig, F., 2023. EU- BUCCO v0.1: European building stock characteristics in a common and open database for ...
-
[10]
URL:https://www.mdpi.com/2220-9964/10/5/275, doi:10.3390/ ijgi10050275. number: 5. Morello, E., Ratti, C., 2009. A Digital Image of the City: 3D Isovists in Lynch’s Urban Analysis. Environment and Planning B: Planning and Design 36, 837–853. URL:https://journals.sagepub.com/action/showAbstract, doi:10.1068/b34144t. 51 Natapov, A., Czamanski, D., Fisher-Ge...
-
[11]
Omrani Azizabad, S., Mahdavinejad, M., Hadighi, M., 2024
URL:https://www.sciencedirect.com/science/article/pii/ S0198971507000555, doi:10.1016/j.compenvurbsys.2007.08.004. Omrani Azizabad, S., Mahdavinejad, M., Hadighi, M., 2024. Three-dimensional embodied visibility graph analysis: Investigating and analyzing values along an outdoor path. Environment and Planning B , 23998083241303199URL: https://doi.org/10.11...
-
[12]
Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N., 2020
URL:https://www.sciencedirect.com/science/article/pii/ S0169204607001399, doi:10.1016/j.landurbplan.2007.05.010. Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N., 2020. Label Stu- dio: Data labeling software. URL:https://github.com/HumanSignal/ label-studio. Tolan, J., Yang, H.I., Nosarzewski, B., Couairon, G., V o, H.V ., Brandt, J., Spore, J., Maj...
-
[13]
CogVLM: Visual Expert for Pretrained Language Models
CogVLM: Visual Expert for Pretrained Language Models. URL: http://arxiv.org/abs/2311.03079, doi:10.48550/arXiv.2311.03079. arXiv:2311.03079 [cs]. Wiener, J.M., Franz, G., 2005. Isovists as a Means to Predict Spatial Experi- ence and Behavior, in: Freksa, C., Knauff, M., Krieg-Br¨uckner, B., Nebel, B., Barkowsky, T. (Eds.), Spatial Cognition IV. Reasoning,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.03079 2005
-
[14]
URL:https://doi.org/10.1007/s10339-021-01012-x, doi:10. 1007/s10339-021-01012-x. Zhao, Y ., Wu, B., Wu, J., Shu, S., Liang, H., Liu, M., Badenko, V ., Fe- dotov, A., Yao, S., Yu, B., 2020. Mapping 3D visibility in an ur- ban street environment from mobile LiDAR point clouds. GIScience & Remote Sensing 57, 797–812. URL:https://doi.org/10.1080/ 15481603.202...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.