pith. sign in

arxiv: 2509.11487 · v2 · submitted 2025-09-15 · 💻 cs.HC · cs.CY

Collective Recourse for Generative Urban Visualizations

Pith reviewed 2026-05-18 17:07 UTC · model grok-4.3

classification 💻 cs.HC cs.CY
keywords collective recoursevisual bug reportstext-to-image diffusionurban visualizationsparticipatory governancemandate thresholdsAI planning toolsgenerative models
0
0 comments X

The pith

Collective recourse lets communities submit structured visual bug reports that trigger targeted fixes in text-to-image models and urban planning workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes collective recourse as a structured process where communities flag harms in generative urban visualizations through visual bug reports. It defines a pipeline of report, triage, fix, verify, and closure, along with four intervention points inside the diffusion model stack. Evaluation on 240 synthetic reports shows that prompt-based fixes deploy quickly yet recur more often, while dataset edits and reward adjustments take longer but produce more durable results and higher planner adoption. Mandate thresholds based on severity, volume, representativeness, and evidence can be tuned for good precision-recall balance. This framework matters because it gives communities a concrete mechanism to influence AI tools that shape public planning decisions.

Core claim

We formalize collective recourse and a practical pipeline that turns community visual bug reports into fixes for text-to-image diffusion models and planning workflows. Four recourse primitives are placed at different layers of the diffusion stack: counter-prompts, negative prompts, dataset edits, and reward-model tweaks. A mandate score that combines severity, volume saturation, representativeness, and evidence determines when a report triggers action. In a synthetic program of 240 reports, prompt-level fixes showed median times of 2.1–3.4 days but 21–38% recurrence, while dataset edits and reward tweaks took 13.5 and 21.9 days yet reduced recurrence to 12–18% with 30–36% planner uptake. A 0

What carries the argument

The collective recourse pipeline with four diffusion-stack primitives (counter-prompts, negative prompts, dataset edits, reward-model tweaks) that convert community reports into model or workflow changes once a mandate threshold is met.

If this is right

  • Prompt-level fixes can be implemented in a median of 2.1 to 3.4 days.
  • Dataset edits and reward-model tweaks require 13.5 and 21.9 days respectively but achieve recurrence rates of only 12–18 percent.
  • The deeper fixes also produce planner uptake rates of 30–36 percent.
  • A mandate threshold of 0.12 reaches 93 percent precision and 75 percent recall.
  • Raising representativeness in the mandate score lifts recall to 81 percent while preserving most of the precision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same report-and-fix loop could be adapted for other generative domains such as architectural renderings or environmental impact simulations.
  • Accumulated fixes over time might gradually improve the equity properties of the underlying training data for urban models.
  • Planners could embed mandate scores into existing review dashboards to prioritize which AI-generated proposals receive human scrutiny.
  • Transparent public dashboards showing which reports led to changes could help maintain community trust in the process.

Load-bearing premise

The assumption that a synthetic program of 240 reports can meaningfully evaluate the real-world durability, planner uptake, and mandate threshold performance of the proposed recourse primitives and pipeline.

What would settle it

A live deployment in which actual community groups submit visual bug reports on urban visualizations and the measured recurrence rates plus planner uptake percentages are compared directly to the synthetic results.

Figures

Figures reproduced from arXiv: 2509.11487 by Rashid Mushkani.

Figure 1
Figure 1. Figure 1: Collective recourse pipeline tying community reports to model fixes and planning workflows. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Text-to-image diffusion models help visualize urban futures but can amplify group-level harms. We propose collective recourse: structured community "visual bug reports" that trigger fixes to models and planning workflows. We (1) formalize collective recourse and a practical pipeline (report, triage, fix, verify, closure); (2) situate four recourse primitives within the diffusion stack: counter-prompts, negative prompts, dataset edits, and reward-model tweaks; (3) define mandate thresholds via a mandate score combining severity, volume saturation, representativeness, and evidence; and (4) evaluate a synthetic program of 240 reports. Prompt-level fixes were fastest (median 2.1-3.4 days) but less durable (21-38% recurrence); dataset edits and reward tweaks were slower (13.5 and 21.9 days) yet more durable (12-18% recurrence) with higher planner uptake (30-36%). A threshold of 0.12 yielded 93% precision and 75% recall; increasing representativeness raised recall to 81% with little precision loss. We discuss integration with participatory governance, risks (e.g., overfitting to vocal groups), and safeguards (dashboards, rotating juries).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that text-to-image diffusion models amplify group-level harms in urban visualizations and proposes 'collective recourse' via structured community visual bug reports. It formalizes a pipeline (report, triage, fix, verify, closure), situates four recourse primitives (counter-prompts, negative prompts, dataset edits, reward-model tweaks) in the diffusion stack, defines a mandate score (severity, volume saturation, representativeness, evidence) for action thresholds, and evaluates via a synthetic program of 240 reports. Results show prompt-level fixes as fastest (median 2.1-3.4 days) but less durable (21-38% recurrence), dataset edits/reward tweaks as slower (13.5/21.9 days) yet more durable (12-18% recurrence) with 30-36% planner uptake, and a 0.12 mandate threshold achieving 93% precision/75% recall (improving to 81% recall with higher representativeness).

Significance. If the synthetic evaluation generalizes, the work could provide a structured HCI-oriented mechanism to address biases in generative tools for urban planning, with explicit integration to participatory governance and safeguards against risks like vocal-group overfitting. Strengths include the clear pipeline formalization, positioning of primitives within the model stack, and quantitative trade-off analysis between speed and durability.

major comments (2)
  1. [§4] §4 (Evaluation): All reported quantitative outcomes—prompt fix medians of 2.1-3.4 days with 21-38% recurrence, dataset-edit medians of 13.5 days with 12-18% recurrence and 30-36% uptake, and the 0.12 mandate threshold at 93% precision/75% recall—are generated exclusively from the synthetic 240-report program. No details are given on report-generation mechanics, triage/planner simulation parameters, or any real-world calibration, so the durability and uptake claims rest on untested modeling assumptions.
  2. [Mandate threshold section] Mandate threshold section (near §3): The 0.12 threshold and its precision/recall figures, plus the representativeness adjustment raising recall to 81%, are presented without sensitivity analysis to the mandate-score weights or to variations in the synthetic recurrence/triage dynamics; this makes the threshold's reported performance load-bearing for the pipeline's practical utility but dependent on the specific simulation.
minor comments (2)
  1. [Abstract] Abstract: The prompt-level fix range (2.1-3.4 days) and recurrence range (21-38%) are presented without stating which primitives or conditions produce the bounds.
  2. [Discussion] Discussion: The safeguards (dashboards, rotating juries) are mentioned but not illustrated with even a brief example of how a rotating jury would alter triage or mandate scoring in the pipeline.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback. We respond point-by-point to the major comments and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation): All reported quantitative outcomes—prompt fix medians of 2.1-3.4 days with 21-38% recurrence, dataset-edit medians of 13.5 days with 12-18% recurrence and 30-36% uptake, and the 0.12 mandate threshold at 93% precision/75% recall—are generated exclusively from the synthetic 240-report program. No details are given on report-generation mechanics, triage/planner simulation parameters, or any real-world calibration, so the durability and uptake claims rest on untested modeling assumptions.

    Authors: We agree that the quantitative results derive from the synthetic program of 240 reports, which is explicitly described as such in the manuscript to enable controlled analysis of the pipeline. In the revision we will expand §4 with a new subsection detailing the report-generation mechanics (including harm scenario distributions and sampling procedures), the exact parameters and decision rules used to simulate triage and planner uptake, and the recurrence model. We will also add an explicit limitations paragraph discussing the modeling assumptions and their implications for generalizability. Real-world calibration is not feasible in the present work, as it would require longitudinal field data from deployed community reporting systems and planner interventions. revision: yes

  2. Referee: [Mandate threshold section] Mandate threshold section (near §3): The 0.12 threshold and its precision/recall figures, plus the representativeness adjustment raising recall to 81%, are presented without sensitivity analysis to the mandate-score weights or to variations in the synthetic recurrence/triage dynamics; this makes the threshold's reported performance load-bearing for the pipeline's practical utility but dependent on the specific simulation.

    Authors: We accept that additional sensitivity analysis is warranted. The revised manuscript will include a new analysis that varies the relative weights of the four mandate-score components and perturbs recurrence and triage accuracy parameters across plausible ranges. Results will be reported in a supplementary table showing precision/recall trade-offs, confirming that 0.12 remains a stable operating point with the representativeness adjustment continuing to improve recall without substantial precision loss. This will be accompanied by guidance on how practitioners might retune the threshold for different deployment contexts. revision: yes

standing simulated objections not resolved
  • Empirical real-world calibration of the synthetic simulation parameters, which would require deployment and collection of actual community reports and planner actions outside the scope of this formalization and synthetic evaluation paper.

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained.

full rationale

The paper first formalizes collective recourse and the report-triage-fix-verify-closure pipeline, situates the four primitives (counter-prompts, negative prompts, dataset edits, reward-model tweaks) in the diffusion stack, and defines the mandate score from severity, volume saturation, representativeness, and evidence. It then runs an independent synthetic program of 240 reports to produce the reported metrics (median fix times, recurrence percentages, planner uptake, and precision/recall at threshold 0.12). These outputs are generated by executing the defined simulation rather than being equivalent to the inputs by construction, and no load-bearing self-citations or ansatzes that collapse the central claims are present. The evaluation tests the framework instead of redefining it.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The proposal introduces new conceptual machinery (collective recourse, mandate score, four primitives) whose effectiveness is tested only synthetically; no external benchmarks or prior empirical results are invoked in the abstract to ground the durability or uptake claims.

free parameters (1)
  • mandate threshold 0.12
    Chosen to achieve 93% precision and 75% recall on the synthetic reports; represents a fitted operating point rather than a derived constant.
axioms (1)
  • domain assumption Community reports can be meaningfully scored for severity, volume saturation, representativeness, and evidence
    Invoked when defining the mandate score that decides whether a report triggers action.
invented entities (2)
  • collective recourse no independent evidence
    purpose: Structured mechanism linking community visual bug reports to model and workflow fixes
    New framework proposed to address group-level harms in generative urban visualizations
  • mandate score no independent evidence
    purpose: Composite metric to decide when reports warrant fixes
    Combines severity, volume saturation, representativeness, and evidence; no external validation cited

pith-pipeline@v0.9.0 · 5739 in / 1496 out tokens · 38482 ms · 2026-05-18T17:07:11.653854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    Arnstein

    doi: 10.1080/01944366908977225. Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, and Cho-Jui Hsieh. Understanding the impact of negative prompts: When and how do they take effect?,

  2. [2]

    8 Emily M

    URL https://arxiv.org/abs/2406.02965. 8 Emily M. Bender and Batya Friedman. Data statements for natural language processing: To- ward mitigating system bias and enabling better science.Transactions of the Association for Computational Linguistics, 6:587–604,

  3. [3]

    Multimodal datasets: Misogyny, pornography, and malignant stereotypes

    Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: Misogyny, pornography, and malignant stereotypes. arXiv:2110.01963,

  4. [4]

    Building, shifting, & employing power: A taxonomy of responses from below to algorithmic harm

    Alicia DeVrio, Motahhare Eslami, and Kenneth Holstein. Building, shifting, & employing power: A taxonomy of responses from below to algorithmic harm. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, pages 1093–1106, New York, NY, USA,

  5. [5]

    doi: 10.1145/3630106.3658958

    Association for Computing Machinery. doi: 10.1145/3630106.3658958. URLhttps://doi.org/10.1145/3630106.3658958. Archon Fung. Varieties of participation in complex governance.Public Administration Review, 66 (s1):66–75,

  6. [6]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli et al. Red teaming language models to reduce harms. arXiv preprint arXiv:2209.07858,

  7. [7]

    Reflective design for informal participatory algorithm auditing: A case study with emotion AI

    Noura Howell et al. Reflective design for informal participatory algorithm auditing: A case study with emotion AI. Proceedings of NordiCHI 2024 (preprint/extended abstract),

  8. [8]

    doi: 10.1016/j.progress.2023.100795

    ISSN 0305-9006. doi: 10.1016/j.progress.2023.100795. URL https://doi.org/10.1016/j.progress. 2023.100795. Amir-Hossein Karimi, Gilles Barthe, Bernhard Schölkopf, and Isabel Valera. A survey of algorithmic recourse.ACM Computing Surveys, 55(5):1–29,

  9. [9]

    Rena Li, Sara Kingsley, Chelsea Fan, Proteeti Sinha, Nora Wai, Jaimie Lee, Hong Shen, Motahhare Eslami, and Jason Hong. Participation and division of labor in user-driven algorithm audits: How do everyday users work together to surface algorithmic harms? InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, pages 1–19, N...

  10. [10]

    doi: 10.1145/3544548.3582074

    Association for Computing Machinery. doi: 10.1145/3544548.3582074. URL https://doi.org/10.1145/3544548.3582074. 9 Shayne Longpre et al. A safe harbor for AI evaluation and red teaming. Knight First Amendment Institute Essay,

  11. [11]

    Stable bias: Analyzing societal representations in diffusion models

    Alexandra Sasha Luccioni et al. Stable bias: Analyzing societal representations in diffusion models. arXiv preprint arXiv:2303.11408,

  12. [12]

    ISBN 9781450361255

    Association for Computing Machinery. doi: 10.1145/3287560. 3287596. Ramaravind Kommiya Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAccT), pages 607–617, New York, NY, USA,

  13. [13]

    Vera Liao, and Rachel K

    Association for Computing Machinery. doi: 10.1145/3351095.3372850. Rashid Mushkani. Urban ai governance must embed legal reasonableness for democratic and sustainable cities,

  14. [14]

    RashidMushkaniandShinKoseki

    URLhttps://arxiv.org/abs/2508.12174. RashidMushkaniandShinKoseki. Intersectingperspectives: Aparticipatorystreetreviewframework for urban inclusivity.Habitat International, 2025a. doi: 10.1016/j.habitatint.2025.103536. URL https://doi.org/10.1016/j.habitatint.2025.103536. Rashid Mushkani and Shin Koseki. Street review: A participatory ai-based framework f...

  15. [15]

    FACE: Feasible and actionable counterfactual explanations

    Association for Computing Machinery. doi: 10.1145/3375627.3375850. Inioluwa Deborah Raji, Andrew Smart, Rebecca White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. InProceedings of the 2020 Confere...

  16. [16]

    White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes

    Association for Computing Machinery. doi: 10.1145/3351095.3372873. Inioluwa Deborah Raji, Peggy Xu, Colleen Honigsberg, and Daniel E. Ho. Outsider oversight: Designing a third party audit ecosystem for AI governance,

  17. [17]

    doi: 10.1038/s41598-025-12032-4. Renée E. Sieber. Public participation geographic information systems: A literature review and framework.Annals of the American Association of Geographers, 96(3):491–507,

  18. [18]

    Actionable recourse in linear classification

    Association for Computing Machinery. doi: 10.1145/3287560.3287566. Julius von Kügelgen, Amir-Hossein Karimi, Umang Bhatt, Isabel Valera, Adrian Weller, and Bernhard Schölkopf. On the fairness of causal algorithmic recourse,

  19. [19]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908; CVPR 2024 version,

  20. [20]

    Wenxuan Wang, Haonan Bai, Jen-tse Huang, Yuxuan Wan, Youliang Yuan, Haoyi Qiu, Nanyun Peng, and Michael R. Lyu. New job, new gender? measuring the social bias in image generation models. arXiv:2401.00763,

  21. [21]

    URLhttps: //arxiv.org/abs/2405.19464. 11