Collective Recourse for Generative Urban Visualizations
Pith reviewed 2026-05-18 17:07 UTC · model grok-4.3
The pith
Collective recourse lets communities submit structured visual bug reports that trigger targeted fixes in text-to-image models and urban planning workflows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize collective recourse and a practical pipeline that turns community visual bug reports into fixes for text-to-image diffusion models and planning workflows. Four recourse primitives are placed at different layers of the diffusion stack: counter-prompts, negative prompts, dataset edits, and reward-model tweaks. A mandate score that combines severity, volume saturation, representativeness, and evidence determines when a report triggers action. In a synthetic program of 240 reports, prompt-level fixes showed median times of 2.1–3.4 days but 21–38% recurrence, while dataset edits and reward tweaks took 13.5 and 21.9 days yet reduced recurrence to 12–18% with 30–36% planner uptake. A 0
What carries the argument
The collective recourse pipeline with four diffusion-stack primitives (counter-prompts, negative prompts, dataset edits, reward-model tweaks) that convert community reports into model or workflow changes once a mandate threshold is met.
If this is right
- Prompt-level fixes can be implemented in a median of 2.1 to 3.4 days.
- Dataset edits and reward-model tweaks require 13.5 and 21.9 days respectively but achieve recurrence rates of only 12–18 percent.
- The deeper fixes also produce planner uptake rates of 30–36 percent.
- A mandate threshold of 0.12 reaches 93 percent precision and 75 percent recall.
- Raising representativeness in the mandate score lifts recall to 81 percent while preserving most of the precision.
Where Pith is reading between the lines
- The same report-and-fix loop could be adapted for other generative domains such as architectural renderings or environmental impact simulations.
- Accumulated fixes over time might gradually improve the equity properties of the underlying training data for urban models.
- Planners could embed mandate scores into existing review dashboards to prioritize which AI-generated proposals receive human scrutiny.
- Transparent public dashboards showing which reports led to changes could help maintain community trust in the process.
Load-bearing premise
The assumption that a synthetic program of 240 reports can meaningfully evaluate the real-world durability, planner uptake, and mandate threshold performance of the proposed recourse primitives and pipeline.
What would settle it
A live deployment in which actual community groups submit visual bug reports on urban visualizations and the measured recurrence rates plus planner uptake percentages are compared directly to the synthetic results.
Figures
read the original abstract
Text-to-image diffusion models help visualize urban futures but can amplify group-level harms. We propose collective recourse: structured community "visual bug reports" that trigger fixes to models and planning workflows. We (1) formalize collective recourse and a practical pipeline (report, triage, fix, verify, closure); (2) situate four recourse primitives within the diffusion stack: counter-prompts, negative prompts, dataset edits, and reward-model tweaks; (3) define mandate thresholds via a mandate score combining severity, volume saturation, representativeness, and evidence; and (4) evaluate a synthetic program of 240 reports. Prompt-level fixes were fastest (median 2.1-3.4 days) but less durable (21-38% recurrence); dataset edits and reward tweaks were slower (13.5 and 21.9 days) yet more durable (12-18% recurrence) with higher planner uptake (30-36%). A threshold of 0.12 yielded 93% precision and 75% recall; increasing representativeness raised recall to 81% with little precision loss. We discuss integration with participatory governance, risks (e.g., overfitting to vocal groups), and safeguards (dashboards, rotating juries).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that text-to-image diffusion models amplify group-level harms in urban visualizations and proposes 'collective recourse' via structured community visual bug reports. It formalizes a pipeline (report, triage, fix, verify, closure), situates four recourse primitives (counter-prompts, negative prompts, dataset edits, reward-model tweaks) in the diffusion stack, defines a mandate score (severity, volume saturation, representativeness, evidence) for action thresholds, and evaluates via a synthetic program of 240 reports. Results show prompt-level fixes as fastest (median 2.1-3.4 days) but less durable (21-38% recurrence), dataset edits/reward tweaks as slower (13.5/21.9 days) yet more durable (12-18% recurrence) with 30-36% planner uptake, and a 0.12 mandate threshold achieving 93% precision/75% recall (improving to 81% recall with higher representativeness).
Significance. If the synthetic evaluation generalizes, the work could provide a structured HCI-oriented mechanism to address biases in generative tools for urban planning, with explicit integration to participatory governance and safeguards against risks like vocal-group overfitting. Strengths include the clear pipeline formalization, positioning of primitives within the model stack, and quantitative trade-off analysis between speed and durability.
major comments (2)
- [§4] §4 (Evaluation): All reported quantitative outcomes—prompt fix medians of 2.1-3.4 days with 21-38% recurrence, dataset-edit medians of 13.5 days with 12-18% recurrence and 30-36% uptake, and the 0.12 mandate threshold at 93% precision/75% recall—are generated exclusively from the synthetic 240-report program. No details are given on report-generation mechanics, triage/planner simulation parameters, or any real-world calibration, so the durability and uptake claims rest on untested modeling assumptions.
- [Mandate threshold section] Mandate threshold section (near §3): The 0.12 threshold and its precision/recall figures, plus the representativeness adjustment raising recall to 81%, are presented without sensitivity analysis to the mandate-score weights or to variations in the synthetic recurrence/triage dynamics; this makes the threshold's reported performance load-bearing for the pipeline's practical utility but dependent on the specific simulation.
minor comments (2)
- [Abstract] Abstract: The prompt-level fix range (2.1-3.4 days) and recurrence range (21-38%) are presented without stating which primitives or conditions produce the bounds.
- [Discussion] Discussion: The safeguards (dashboards, rotating juries) are mentioned but not illustrated with even a brief example of how a rotating jury would alter triage or mandate scoring in the pipeline.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We respond point-by-point to the major comments and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation): All reported quantitative outcomes—prompt fix medians of 2.1-3.4 days with 21-38% recurrence, dataset-edit medians of 13.5 days with 12-18% recurrence and 30-36% uptake, and the 0.12 mandate threshold at 93% precision/75% recall—are generated exclusively from the synthetic 240-report program. No details are given on report-generation mechanics, triage/planner simulation parameters, or any real-world calibration, so the durability and uptake claims rest on untested modeling assumptions.
Authors: We agree that the quantitative results derive from the synthetic program of 240 reports, which is explicitly described as such in the manuscript to enable controlled analysis of the pipeline. In the revision we will expand §4 with a new subsection detailing the report-generation mechanics (including harm scenario distributions and sampling procedures), the exact parameters and decision rules used to simulate triage and planner uptake, and the recurrence model. We will also add an explicit limitations paragraph discussing the modeling assumptions and their implications for generalizability. Real-world calibration is not feasible in the present work, as it would require longitudinal field data from deployed community reporting systems and planner interventions. revision: yes
-
Referee: [Mandate threshold section] Mandate threshold section (near §3): The 0.12 threshold and its precision/recall figures, plus the representativeness adjustment raising recall to 81%, are presented without sensitivity analysis to the mandate-score weights or to variations in the synthetic recurrence/triage dynamics; this makes the threshold's reported performance load-bearing for the pipeline's practical utility but dependent on the specific simulation.
Authors: We accept that additional sensitivity analysis is warranted. The revised manuscript will include a new analysis that varies the relative weights of the four mandate-score components and perturbs recurrence and triage accuracy parameters across plausible ranges. Results will be reported in a supplementary table showing precision/recall trade-offs, confirming that 0.12 remains a stable operating point with the representativeness adjustment continuing to improve recall without substantial precision loss. This will be accompanied by guidance on how practitioners might retune the threshold for different deployment contexts. revision: yes
- Empirical real-world calibration of the synthetic simulation parameters, which would require deployment and collection of actual community reports and planner actions outside the scope of this formalization and synthetic evaluation paper.
Circularity Check
No significant circularity; derivation remains self-contained.
full rationale
The paper first formalizes collective recourse and the report-triage-fix-verify-closure pipeline, situates the four primitives (counter-prompts, negative prompts, dataset edits, reward-model tweaks) in the diffusion stack, and defines the mandate score from severity, volume saturation, representativeness, and evidence. It then runs an independent synthetic program of 240 reports to produce the reported metrics (median fix times, recurrence percentages, planner uptake, and precision/recall at threshold 0.12). These outputs are generated by executing the defined simulation rather than being equivalent to the inputs by construction, and no load-bearing self-citations or ansatzes that collapse the central claims are present. The evaluation tests the framework instead of redefining it.
Axiom & Free-Parameter Ledger
free parameters (1)
- mandate threshold 0.12
axioms (1)
- domain assumption Community reports can be meaningfully scored for severity, volume saturation, representativeness, and evidence
invented entities (2)
-
collective recourse
no independent evidence
-
mandate score
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose collective recourse: structured community 'visual bug reports' that trigger fixes to models and planning workflows... mandate score M(b) = s·(1-exp(-n/τn))·r·q
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Prompt-level fixes were fastest (median 2.1-3.4 days) but less durable (21-38% recurrence); dataset edits... 13.5 days... 30-36% planner uptake
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1080/01944366908977225. Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, and Cho-Jui Hsieh. Understanding the impact of negative prompts: When and how do they take effect?,
- [2]
-
[3]
Multimodal datasets: Misogyny, pornography, and malignant stereotypes
Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: Misogyny, pornography, and malignant stereotypes. arXiv:2110.01963,
-
[4]
Building, shifting, & employing power: A taxonomy of responses from below to algorithmic harm
Alicia DeVrio, Motahhare Eslami, and Kenneth Holstein. Building, shifting, & employing power: A taxonomy of responses from below to algorithmic harm. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, pages 1093–1106, New York, NY, USA,
work page 2024
-
[5]
Association for Computing Machinery. doi: 10.1145/3630106.3658958. URLhttps://doi.org/10.1145/3630106.3658958. Archon Fung. Varieties of participation in complex governance.Public Administration Review, 66 (s1):66–75,
-
[6]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli et al. Red teaming language models to reduce harms. arXiv preprint arXiv:2209.07858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Reflective design for informal participatory algorithm auditing: A case study with emotion AI
Noura Howell et al. Reflective design for informal participatory algorithm auditing: A case study with emotion AI. Proceedings of NordiCHI 2024 (preprint/extended abstract),
work page 2024
-
[8]
doi: 10.1016/j.progress.2023.100795
ISSN 0305-9006. doi: 10.1016/j.progress.2023.100795. URL https://doi.org/10.1016/j.progress. 2023.100795. Amir-Hossein Karimi, Gilles Barthe, Bernhard Schölkopf, and Isabel Valera. A survey of algorithmic recourse.ACM Computing Surveys, 55(5):1–29,
-
[9]
Rena Li, Sara Kingsley, Chelsea Fan, Proteeti Sinha, Nora Wai, Jaimie Lee, Hong Shen, Motahhare Eslami, and Jason Hong. Participation and division of labor in user-driven algorithm audits: How do everyday users work together to surface algorithmic harms? InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, pages 1–19, N...
work page 2023
-
[10]
Association for Computing Machinery. doi: 10.1145/3544548.3582074. URL https://doi.org/10.1145/3544548.3582074. 9 Shayne Longpre et al. A safe harbor for AI evaluation and red teaming. Knight First Amendment Institute Essay,
-
[11]
Stable bias: Analyzing societal representations in diffusion models
Alexandra Sasha Luccioni et al. Stable bias: Analyzing societal representations in diffusion models. arXiv preprint arXiv:2303.11408,
-
[12]
Association for Computing Machinery. doi: 10.1145/3287560. 3287596. Ramaravind Kommiya Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAccT), pages 607–617, New York, NY, USA,
-
[13]
Association for Computing Machinery. doi: 10.1145/3351095.3372850. Rashid Mushkani. Urban ai governance must embed legal reasonableness for democratic and sustainable cities,
-
[14]
URLhttps://arxiv.org/abs/2508.12174. RashidMushkaniandShinKoseki. Intersectingperspectives: Aparticipatorystreetreviewframework for urban inclusivity.Habitat International, 2025a. doi: 10.1016/j.habitatint.2025.103536. URL https://doi.org/10.1016/j.habitatint.2025.103536. Rashid Mushkani and Shin Koseki. Street review: A participatory ai-based framework f...
-
[15]
FACE: Feasible and actionable counterfactual explanations
Association for Computing Machinery. doi: 10.1145/3375627.3375850. Inioluwa Deborah Raji, Andrew Smart, Rebecca White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. InProceedings of the 2020 Confere...
-
[16]
Association for Computing Machinery. doi: 10.1145/3351095.3372873. Inioluwa Deborah Raji, Peggy Xu, Colleen Honigsberg, and Daniel E. Ho. Outsider oversight: Designing a third party audit ecosystem for AI governance,
-
[17]
doi: 10.1038/s41598-025-12032-4. Renée E. Sieber. Public participation geographic information systems: A literature review and framework.Annals of the American Association of Geographers, 96(3):491–507,
-
[18]
Actionable recourse in linear classification
Association for Computing Machinery. doi: 10.1145/3287560.3287566. Julius von Kügelgen, Amir-Hossein Karimi, Umang Bhatt, Isabel Valera, Adrian Weller, and Bernhard Schölkopf. On the fairness of causal algorithmic recourse,
-
[19]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908; CVPR 2024 version,
- [20]
- [21]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.