pith. sign in

arxiv: 2604.22103 · v1 · submitted 2026-04-23 · 💻 cs.CY · cs.CV

How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits

Pith reviewed 2026-05-08 13:33 UTC · model grok-4.3

classification 💻 cs.CY cs.CV
keywords urban perceptioncounterfactual explanationstreet-view imagesimage editingsafety perceptionvisual leversinterventional methodsexplainable AI
0
0 comments X

The pith

A lever-based framework identifies which localized visual edits in street scenes shift perceived safety using prompt-conditioned image changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to explain urban perception models by testing specific visual changes instead of relying on correlations alone. It defines levers as combinations of semantic concepts, spatial locations, change directions, and edit templates, then generates candidate edits via image prompts and filters them for realism and scene consistency. In a pilot with 50 scenes from five cities, edits to mobility infrastructure and physical maintenance produced the largest shifts in safety proxies from the models. This approach matters because it points toward actionable visual modifications that could influence how people judge city environments. The authors position human pairwise judgments as the required next validation step beyond current proxy results.

Core claim

The paper claims that recasting scene explainability as a bounded search over structured counterfactual edits, each defined by a semantic concept, spatial support, intervention direction, and constrained template, allows identification of localized visual changes that plausibly alter perceptions, with pilot results showing directional patterns where mobility infrastructure and physical maintenance edits yield the largest auxiliary safety shifts under validated prompt-only editing.

What carries the argument

The lever-based interventional counterfactual framework, which structures perception shifts as searches over semantic-spatial-intervention templates with post-edit validity checks for preservation, locality, realism, and plausibility.

If this is right

  • Edits targeting mobility infrastructure and physical maintenance produce the largest shifts in safety proxies across the tested scenes.
  • Prompt-only editing yields a practical taxonomy of failure modes that can inform refinements to the generation process.
  • Proxy-based directional patterns emerge for which levers affect urban perception attributes in the pilot data.
  • The framework scales to multiple scenes and cities while maintaining bounded search over constrained edit templates.
  • Future human pairwise judgments serve as the required ground-truth endpoint to confirm or adjust the proxy findings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • City planners could simulate targeted fixes such as adding protected bike lanes before implementation to estimate perception impacts.
  • The lever structure might extend to other perception attributes like walkability or vibrancy to reveal attribute-specific visual drivers.
  • Early integration of small human rating sets could calibrate the proxy models and reduce reliance on post-hoc validation.
  • Combining the method with field experiments in actual neighborhoods would test whether simulated edits predict real-world judgment changes.

Load-bearing premise

That prompt-conditioned image editing can reliably generate edits passing checks for same-place preservation, locality, realism, and plausibility, and that resulting shifts in proxy models correspond to actual changes in human perception judgments.

What would settle it

A study in which human raters perform pairwise safety comparisons on the original and edited image pairs and find no consistent directional shifts matching the proxy model results for the levers ranked highest in the pilot.

read the original abstract

Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to develop a lever-based interventional counterfactual framework to identify localized visual changes that shift human perceptions of urban scenes, such as safety. Levers are defined by semantic concepts, spatial support, intervention direction, and edit templates. Edits are generated using prompt-conditioned image editing and filtered by validity checks for same-place preservation, locality, realism, and plausibility. A pilot on 50 scenes from five cities shows proxy-based patterns where Mobility Infrastructure and Physical Maintenance levers produce the largest safety shifts, along with a failure taxonomy for prompt-only edits. Human pairwise judgments are reserved for future validation.

Significance. If the validity checks ensure realistic edits and the proxy shifts align with human judgments, this framework could provide a scalable, interventional approach to explain and potentially guide urban perception models beyond current correlational methods. It offers a structured way to search for actionable visual levers, which could have applications in urban planning and design. The practical failure taxonomy for prompt-based editing is a valuable contribution for future work in this area. However, the pilot results being proxy-only and without detailed quantitative validation limit the current significance.

major comments (2)
  1. Abstract: The statement that Mobility Infrastructure and Physical Maintenance show the largest auxiliary safety shifts is not supported by any reported numbers, such as the magnitude of shifts, the number of scenes or edits per category, or comparisons to other levers. This is load-bearing for the directional patterns claimed and requires quantitative backing from the pilot data.
  2. Methods (validity checks section): No information is given on the proportion of generated edits that pass the four validity checks or on how the checks are implemented (e.g., automated metrics or human review). Since the patterns depend on the retained edits, this omission risks the results being driven by systematic biases in the editing process rather than genuine perceptual changes.
minor comments (1)
  1. The abstract could briefly specify the proxy perception model used for the pilot shifts to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that the manuscript requires additional quantitative detail and methodological transparency to strengthen the presentation of the pilot results.

read point-by-point responses
  1. Referee: Abstract: The statement that Mobility Infrastructure and Physical Maintenance show the largest auxiliary safety shifts is not supported by any reported numbers, such as the magnitude of shifts, the number of scenes or edits per category, or comparisons to other levers. This is load-bearing for the directional patterns claimed and requires quantitative backing from the pilot data.

    Authors: We agree that the abstract statement would be stronger with explicit quantitative support. The pilot results section reports directional patterns derived from proxy safety scores across the 50 scenes and retained edits, but the abstract itself does not include magnitudes, per-category counts, or direct comparisons. In the revision we will update the abstract to incorporate summary statistics from the pilot (e.g., average proxy safety shift per lever category, number of valid edits retained per category, and relative ordering), ensuring the claim is directly backed by the reported data. revision: yes

  2. Referee: Methods (validity checks section): No information is given on the proportion of generated edits that pass the four validity checks or on how the checks are implemented (e.g., automated metrics or human review). Since the patterns depend on the retained edits, this omission risks the results being driven by systematic biases in the editing process rather than genuine perceptual changes.

    Authors: This is a fair criticism. The Methods section defines the four validity checks but omits pass-rate statistics and implementation specifics. We will add a dedicated paragraph and summary table reporting the proportion of edits that passed each check (and all checks combined) in the pilot, together with the concrete implementation: automated metrics (CLIP similarity for same-place preservation, bounding-box overlap for locality, and perceptual-loss thresholds for realism) supplemented by human review on a 20% random sample for plausibility. This will allow readers to evaluate potential selection biases in the retained edit set. revision: yes

Circularity Check

0 steps flagged

No circularity: framework applies external image models to generate edits and reports proxy patterns without reducing claims to fitted inputs or self-citations

full rationale

The paper introduces a lever-based interventional framework that generates candidate edits via prompt-conditioned image editing, applies external validity checks (same-place preservation, locality, realism, plausibility), and reports preliminary proxy-based directional patterns from a 50-scene pilot. No equations, fitted parameters, or self-definitional reductions appear in the derivation; the central results are explicitly proxy-based and defer human pairwise judgments to future validation. The approach relies on independent components (pre-trained image editors, proxy perception models) whose outputs are not constructed from the target patterns themselves. This satisfies the criteria for a self-contained, non-circular analysis against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on untested assumptions about the fidelity of prompt-based image editing and the correspondence between proxy scores and human perception; no free parameters or invented physical entities are specified.

axioms (1)
  • domain assumption Prompt-conditioned image editing can generate edits that preserve same-place identity, locality, realism, and plausibility.
    Invoked as the basis for generating and retaining candidate edits in the framework.
invented entities (1)
  • Visual lever no independent evidence
    purpose: Specifies a semantic concept, spatial support, intervention direction, and constrained edit template to bound the search over counterfactual edits.
    Core structuring device introduced by the paper to organize the interventional search.

pith-pipeline@v0.9.0 · 5438 in / 1289 out tokens · 81049 ms · 2026-05-08T13:33:54.056046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Sanity checks for saliency maps

    Julius Adebayo, Justin Gilmer, Michael Christoph Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. InAdvances in Neural Information Pro- cessing Systems, pages 9505–9515, 2018

  2. [2]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions. 4 InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  3. [3]

    Hidalgo, Nicu Sebe, and Bruno Lepri

    Marco De Nadai, Radu-Laurentiu Vieriu, Gabriele Zen, Suzana Dragicevic, Nikhil Naik, Michele Caraviello, Ce- sar A. Hidalgo, Nicu Sebe, and Bruno Lepri. Are safer look- ing neighborhoods more lively? a multimodal investigation into urban life. InProceedings of the 24th ACM International Conference on Multimedia, 2016

  4. [4]

    Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, and Cesar A. Hidalgo. Deep learning the city: Quan- tifying urban perception at a global scale. InComputer Vi- sion – ECCV 2016, pages 196–212. Springer, 2016

  5. [5]

    Counterfactual visual explanations

    Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual explanations. InPro- ceedings of the 36th International Conference on Machine Learning, pages 2376–2384, 2019

  6. [6]

    Prompt-to-prompt image editing with cross-attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. InInternational Confer- ence on Learning Representations, 2023

  7. [7]

    Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. Global streetscapes – a comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics.ISPRS Journal of Photogram- metry and Remote Sensing, 215:216–238, 2024

  8. [8]

    Random House, 1961

    Jane Jacobs.The Death and Life of Great American Cities. Random House, 1961

  9. [9]

    Bin Jiang, Cecilia Nga Sze Mak, Hua Zhong, Linda Larsen, and Christopher John Webster. From broken windows to per- ceived routine activities: Examining impacts of environmen- tal interventions on perceived safety of urban alleys.Fron- tiers in Psychology, 9:2450, 2018

  10. [10]

    Inter- pretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Inter- pretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InProceedings of the 35th International Conference on Machine Learning, pages 2668–2677, 2018

  11. [11]

    Explaining holistic image regressors and classifiers in urban analytics with plausible counterfactuals

    Stephen Law, Rikuo Hasegawa, Brooks Paige, Chris Russell, and Andrew Elliott. Explaining holistic image regressors and classifiers in urban analytics with plausible counterfactuals. International Journal of Geographical Information Science, 37:2575–2596, 2023

  12. [12]

    Xiaojiang Li, Chuanrong Zhang, and Weidong Li. Does the visibility of greenery increase perceived safety in urban areas? evidence from the place pulse 1.0 dataset.ISPRS International Journal of Geo-Information, 4(3):1166–1183, 2015

  13. [13]

    Loewen, G

    Linda J. Loewen, G. Daniel Steel, and Peter Suedfeld. Per- ceived safety from crime in the urban environment.Journal of Environmental Psychology, 13(4):323–331, 1993

  14. [14]

    Tsaftaris

    Thomas Melistas, Nikos Spyrou, Nefeli Gkouti, Pedro Sanchez, Athanasios Vlontzos, Yannis Panagakis, Giorgos Papanastasiou, and Sotirios A. Tsaftaris. Benchmarking counterfactual image generation. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

  15. [15]

    Sdedit: Guided image synthesis and editing with stochastic differential equa- tions

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions, 2022

  16. [16]

    Moreno, Andres De La Puente, and Jorge Poco

    Felipe A. Moreno, Andres De La Puente, and Jorge Poco. Urbanphysicaldisorder-4k: Understanding urban perception via counterfactuals and street view signs of physical disorder. InIEEE International Conference on Big Data, pages 5194– 5200, 2025

  17. [17]

    Nikhil Naik, Jade Philipoom, Ramesh Raskar, and Cesar A. Hidalgo. Streetscore – predicting the perceived safety of one million streetscapes. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, pages 779–785, 2014

  18. [18]

    Macmillan, New York, 1972

    Oscar Newman.Defensible Space: Crime Prevention through Urban Design. Macmillan, New York, 1972

  19. [19]

    Portnov, Rasha Saad, Tal Trop, Doron Kliger, and Anna Svechkina

    Boris A. Portnov, Rasha Saad, Tal Trop, Doron Kliger, and Anna Svechkina. Linking nighttime outdoor lighting at- tributes to pedestrians’ feeling of safety: An interactive sur- vey approach.PLOS ONE, 15(11):e0242172, 2020

  20. [20]

    Global urban visual perception varies across demo- graphics and personalities.Nature Cities, 2(11):1092–1106, 2025

    Matias Quintana, Youlong Gu, Xiucheng Liang, Yujun Hou, Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, and Filip Biljecki. Global urban visual perception varies across demo- graphics and personalities.Nature Cities, 2(11):1092–1106, 2025

  21. [21]

    Philip Salesses, Katja Schechtner, and Cesar A. Hidalgo. The collaborative image of the city: Mapping the inequality of urban perception.PLOS ONE, 8(7):e68400, 2013

  22. [22]

    Street space quality improve- ment: Fusion of subjective perception in street view image generation.Information Fusion, 125:103467, 2026

    Chenbo Zhao, Yoshiki Ogawa, Shenglong Chen, Takuya Oki, and Yoshihide Sekimoto. Street space quality improve- ment: Fusion of subjective perception in street view image generation.Information Fusion, 125:103467, 2026. 5 A. Appendix A.1. Pipeline and Intervention Vocabulary Figure A1. Overview of the lever-based interventional counterfactual pipeline. Each...