Telling Speculative Stories to Help Humans Imagine the Harms of Healthcare AI

Anthony Rios; Dan Schumacher; Tongnian Wang; Veronica Rammouz; Xingmeng Zhao

arxiv: 2510.14718 · v3 · submitted 2025-10-16 · 💻 cs.CL

Telling Speculative Stories to Help Humans Imagine the Harms of Healthcare AI

Xingmeng Zhao , Tongnian Wang , Dan Schumacher , Veronica Rammouz , Anthony Rios This is my paper

Pith reviewed 2026-05-18 06:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative storytellinghealthcare AIpotential harmsuser studymulti-agent systemshuman-centered designAI ethicsrisk anticipation

0 comments

The pith

Speculative stories help people recognize a broader range of harms from healthcare AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a human-centered framework that generates speculative user stories and facilitates multi-agent discussions to help people creatively consider the benefits and harms of AI in healthcare before deployment. In a user study, those who read the stories identified harms distributed across all 17 types, whereas those who did not primarily focused on privacy and well-being at 79.1 percent. This method aims to boost human engagement in understanding how harms arise and who they affect, moving beyond automated detection. Sympathetic readers would care because it addresses the risk of rapid AI development ignoring real-world contexts and diverse needs.

Core claim

The framework generates user stories and supports multi-agent discussions, leading participants in the study to recognize a broader range of harms and benefits, distributing responses more evenly across 17 harm types compared to focusing mainly on privacy and well-being.

What carries the argument

The speculative story generation combined with multi-agent discussions to prompt creative speculation about AI's impact on users.

Load-bearing premise

The speculative stories and multi-agent discussions accurately capture plausible real-world contexts and diverse user needs without their own biases or missing key harms.

What would settle it

Deploy the AI systems discussed in the stories and observe if the predicted harms actually occur at rates matching the broadened recognition from the study.

Figures

Figures reproduced from arXiv: 2510.14718 by Anthony Rios, Dan Schumacher, Tongnian Wang, Veronica Rammouz, Xingmeng Zhao.

**Figure 1.** Figure 1: Illustration of using speculative stories to help people imagine potential harms and benefits of healthcare AI and foster more creative and ethical thinking. up AI development, it also introduces risks related to fairness, bias, and accountability (Weidinger et al., 2023). These risks are serious in healthcare, where even small model errors can cause harm, including delayed treatment, privacy loss, or he… view at source ↗

**Figure 2.** Figure 2: Overview of the Storytelling Framework. We first generate use case scenarios from AI concepts sourced from PubMed, Wired, and industry app descriptions. Next, we simulates role-playing and environment trajectories for each scenario, producing detailed simulation logs. Finally, we rephrase these logs into short stories that illustrate both potential benefits and harms of the AI system. and public health. We… view at source ↗

**Figure 3.** Figure 3: Results of human preference evaluation. Our Storytelling method achieves strong preference wins against the baseline, with 88% preference using Llama3 and 76% using Gemma3. consistent judgments across comparisons. Comparing models, baseline Gemma slightly outperforms Llama3 in most metrics, but under the Storytelling framework, Llama3 nearly closes the gap with an overall score of 89.24%. Interestingly, L… view at source ↗

**Figure 4.** Figure 4: A qualitative example showing how our storytelling method makes the AI’s decision process and its consequences easy to follow. Unlike a simple narrative description, the story explicitly surfaces what changed, why it changed, and how stakeholders were affected. work (Kwon et al., 2023), with temperature set to 0.1 and a maximum token limit of 16,384. We use GPT-4o as the judge model for all evaluations. Di… view at source ↗

**Figure 5.** Figure 5: A comparison of simulation logs under different ablations (w/o Environment Trajectories and w/o RolePlaying) to show the contribution of each component. ticipants joining the Red-Team Discussion Room via computer and interacting with moderators either in person or over Zoom. Survey instruments are detailed in [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Screenshot of our annotation interface used for human evaluation. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Interface of the Story-Driven Red-Team Discussion Room, showing the multi-agent conversational setup and user interaction flow. Demographic Attribute Sample (N=18) Gender Female 38.9% Male 61.1% Other/Non-binary 0.0% Prefer not to answer 0.0% Age 18–29 50.0% 30–39 33.3% 40–49 11.1% 50–59 5.6% 60+ 0.0% Prefer not to answer 0.0% Ethnicity Hispanic or Latino 27.8% Asian 44.4% Black or African descent 11.1% Ar… view at source ↗

**Figure 8.** Figure 8: Interface of the Story-Driven Red-Team Discussion Room, where expert agents simulate a discussion by responding to the user’s input [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Interface used in the model card study, illustrating how participants completed the speculative model card. bridging abstract risks and individual context and needs. Suggestions and General Thoughts Participants in the storytelling condition sought richer, multimodal scaffolds to trigger deeper ethical reflection. They emphasized that seeing concrete examples and role-based perspectives would help them “t… view at source ↗

**Figure 10.** Figure 10: Pre-study survey assessing demographics, AI familiarity, and baseline attitudes toward story-based [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Post-study survey assessing clarity, confidence in documenting risks, and the contribution of narrative [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt used to create use-case scenarios from AI concept descriptions. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: The prompt for Storytelling Framework to simulate role-playing and environment trajectories. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt used for rephrasing AI trajectory logs into ethical harm narratives. [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt used for the plot-planning story generation baseline. [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: System prompt for LLM-as-a-judge criteria for evaluating stories. [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: Evaluation criteria checklist for LLM-as-a-judge. [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

read the original abstract

Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 17 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (79.1%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI's impact on users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The user study shows that reading AI-generated speculative stories led participants to identify harms more evenly across 17 types instead of focusing mostly on privacy and well-being.

read the letter

The one or two things to know: The user study found that people who read the generated stories identified harms more evenly across all 17 types, unlike the control group that focused 79.1% on privacy and well-being. This suggests the storytelling approach broadens thinking about AI harms in healthcare. What is new is the specific setup that combines AI-generated speculative stories with multi-agent discussions and then quantifies the distribution of identified harms. Prior work in design fiction and HCI has used storytelling, but this adds the multi-agent element and the measurement across harm categories for healthcare AI specifically. The paper does well in presenting a clear contrast in the study results and in emphasizing human engagement over automated risk detection alone. It highlights issues like bias, privacy, and unequal access in tools such as wellness trackers and chatbots. The framework aims to help developers and users anticipate harms by imagining plausible scenarios. That human-centered angle is a strength, as it encourages creative speculation about who is affected and in what contexts. Soft spots include the limited information on the study execution. The abstract lacks details on participant numbers, statistical analysis, demographics, story generation specifics, and control design. This makes it difficult to assess how solid the difference really is. The stress-test concern about possible priming or bias in the stories is reasonable and should be addressed with more transparency on how the stories were created to ensure they represent plausible contexts without favoring certain harms. If the full paper has those elements and shows no such bias, the claim holds better. Overall, this paper targets researchers and practitioners in AI for healthcare, HCI, and ethics who are concerned with safety and equity in rapid AI deployment. A reader who values methods for early harm anticipation through creative techniques would find it useful. The presence of an empirical user study and engagement with relevant literature makes it worth a serious referee's time, even with the need for more methodological detail. I recommend sending this to peer review to get expert input on the study validity and potential improvements.

Referee Report

2 major / 2 minor

Summary. The paper proposes a human-centered framework that generates speculative user stories via multi-agent AI discussions to help stakeholders imagine benefits and harms of healthcare AI systems before deployment. It reports a user study in which participants who read the stories distributed their identified harms more evenly across 17 types, while the control group concentrated 79.1% of responses on privacy and well-being concerns.

Significance. If the empirical result holds after addressing methodological gaps, the work could meaningfully advance proactive, human-engaged harm anticipation in healthcare AI, offering a complement to automated detection approaches by fostering creative speculation about diverse user contexts and needs.

major comments (2)

[User study results] User study results (abstract and §4): the central empirical claim of broader harm recognition rests on a comparison of response distributions, yet the manuscript omits sample size, participant demographics, exact statistical tests (e.g., chi-square or entropy measures for evenness), and control-condition wording. These details are load-bearing for interpreting whether the observed difference (even distribution vs. 79.1% focus) is reliable and generalizable.
[Framework and story-generation section] Framework and story-generation section (§3): the multi-agent process used to create the speculative stories may systematically cue the 17 harm categories, turning the user-study effect into an artifact of story content rather than evidence that storytelling per se improves imagination. The paper should report a content analysis of the generated stories (e.g., frequency of explicit harm-type mentions) and/or include a neutral-story control arm to rule out priming.

minor comments (2)

[Throughout] The 17 harm types are referenced repeatedly but never enumerated or tabulated; adding a concise list or appendix table would aid reader comprehension without altering the argument.
[Methods and results] Notation for harm categories and benefit/harm distinctions could be standardized (e.g., consistent use of H1–H17) to improve traceability between the framework description and study results.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback, which helps strengthen the clarity and rigor of our work on using speculative storytelling for harm anticipation in healthcare AI. We address each major comment below with point-by-point responses and note the revisions made to the manuscript.

read point-by-point responses

Referee: [User study results] User study results (abstract and §4): the central empirical claim of broader harm recognition rests on a comparison of response distributions, yet the manuscript omits sample size, participant demographics, exact statistical tests (e.g., chi-square or entropy measures for evenness), and control-condition wording. These details are load-bearing for interpreting whether the observed difference (even distribution vs. 79.1% focus) is reliable and generalizable.

Authors: We agree that these methodological details are essential for readers to assess reliability and generalizability. The original submission placed some of this information in supplementary materials rather than the main text. In the revised manuscript, we have expanded §4 to report the sample size, full participant demographics, the statistical tests performed (chi-square test for comparing harm-type distributions across conditions and Shannon entropy to quantify evenness), and the exact wording of the control condition prompt. We have also updated the abstract to reference these additions for immediate visibility. revision: yes
Referee: [Framework and story-generation section] Framework and story-generation section (§3): the multi-agent process used to create the speculative stories may systematically cue the 17 harm categories, turning the user-study effect into an artifact of story content rather than evidence that storytelling per se improves imagination. The paper should report a content analysis of the generated stories (e.g., frequency of explicit harm-type mentions) and/or include a neutral-story control arm to rule out priming.

Authors: This concern about potential priming is well-taken and merits direct examination. We have added a content analysis of the generated stories to §3, which shows that explicit references to the 17 harm categories were infrequent and that stories primarily emphasized narrative user contexts. We discuss design choices intended to reduce direct cueing. A neutral-story control arm would provide additional evidence but would require a new experiment; we have noted this as a limitation and direction for future work in the revised discussion section. revision: partial

standing simulated objections not resolved

Implementing a new user study arm with neutral stories to fully isolate priming effects from the storytelling intervention itself.

Circularity Check

0 steps flagged

No circularity: central result is direct empirical observation from user study

full rationale

The paper presents a human-centered framework for generating speculative stories and multi-agent discussions, then evaluates it via a user study measuring harm recognition across 17 types. The key finding—that story readers distributed responses more evenly while non-readers focused 79.1% on privacy and well-being—is a measured outcome from participant responses, not a quantity derived by fitting parameters to the same data or by self-referential definitions. No equations, uniqueness theorems, or ansatzes are invoked to force the result; the framework's generation process is described separately from the independent study measurement. The noted assumption about story plausibility is an external validity concern rather than a circular reduction in the reported derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that speculative stories can broaden imagination of harms beyond direct consideration; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Speculative stories and multi-agent discussions can effectively broaden human imagination of AI harms beyond direct consideration.
This premise underpins both the framework design and the interpretation of the user study results.

pith-pipeline@v0.9.0 · 5703 in / 1377 out tokens · 40987 ms · 2026-05-18T06:23:54.585458+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean (and Cost/FunctionalEquation.lean) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

arXiv preprint arXiv:2401.04883 , year=

Effectiveness of a mental health chatbot for people with chronic diseases: randomized controlled trial.JMIR Formative Research, 8:e50025. Michael A. Madaio, Luke Stark, Jennifer Wortman Vaughan, and Hanna M. Wallach. 2020. Co- designing checklists to understand organizational challenges and opportunities around fairness in AI. InCHI ’20: CHI Conference on...

work page arXiv 2020
[2]

InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26

Riskrag: A data-driven solution for improved ai model risk reporting. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26. Stephen Roller, Y-Lan Boureau, Jason Weston, Antoine Bordes, Emily Dinan, Angela Fan, David Gunning, Da Ju, Margaret Li, Spencer Poff, et al. 2020. Open- domain conversational agents: Current pro...

work page arXiv 2025
[3]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Per- rin, Tatiana Matejovicova, Alexandre Ramé, Mor- gane Rivière, et al

Mapping caregiver needs to ai chatbot de- sign: Strengths and gaps in mental health support for alzheimer’s and dementia caregivers.ArXiv preprint, abs/2506.15047. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Per- rin, Tatiana Matejovicova, Alexandre Ramé, Mor- gane Rivière, et al. 2025. Gemma 3 technical...

work page arXiv 2025
[4]

Isaac,Sociotechnical safety evaluation of generative ai systems, ArXivabs/2310.11986(2023)

Why fairness cannot be automated: Bridging the gap between eu non-discrimination law and ai. Computer Law & Security Review, 41:105567. Ruoyao Wang, Graham Todd, Xingdi Yuan, Ziang Xiao, Marc-Alexandre Côté, and Peter Jansen. 2023. Byte- sized32: A corpus and challenge task for generating task-specific world models expressed as text games. InProceedings o...

work page arXiv 2023
[5]

AI Nutrition Labels

Making large language models into world models with precondition and effect knowledge. In Proceedings of the 31st International Conference on Computational Linguistics, pages 7532–7545. Lili Yao, Nanyun Peng, Ralph M. Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. 2019. Plan- and-write: Towards better automatic storytelling. In The Thirty-Third AAAI...

work page 2019
[6]

What is the capability the AI system provides for supporting daily health and well-being?

work page
[7]

Who is operating the system, who is affected by its outputs, and who else is involved?

work page
[8]

What is the context of AI system use? The context should be a realistic or fictional everyday situation where someone uses the AI system, what it helps them do, what it predicts or decides, and how they use the result

work page
[9]

Why could this scenario involve problematic uses of the AI system or potential ethical harms? (e.g., unfair outcomes, misuse beyond original purpose, lack of consent, or reinforcing bias)

work page
[10]

– Simulation Started –

What symptoms or behaviors might this AI misread because of a person’s background or identity? Scenario Format: Scenario {{Scenario number (starting from 1)}}: [Capability]: {{Core AI function (e.g., monitor mood, predict symptoms)}} [AI User]: {{Who uses it (e.g., caregiver, doctor)}} [AI Subject]: {{Who is affected (e.g., patient, child, community)? Be ...

work page
[11]

One sentence: who is using the AI system and what they are using it for?

work page
[12]

One sentence: how the AI’s prediction or output is used to make a decision?

work page
[13]

One sentence: what goes wrong—what the AI misses about the person’s identity, background, or needs?

work page
[14]

One sentence: who is affected by the failure?

work page
[15]

system_prompt

One sentence: what harm is caused, and why this raises ethical concerns? Example Story Seed: [Capability]: AI triage for early detection of depressive symptoms during telehealth sessions [AI_User]: A high-achieving Southeast Asian college student navigating intense academic pressure and hidden emotional distress [AI_Subject]: a high-achieving college stud...

work page

[1] [1]

arXiv preprint arXiv:2401.04883 , year=

Effectiveness of a mental health chatbot for people with chronic diseases: randomized controlled trial.JMIR Formative Research, 8:e50025. Michael A. Madaio, Luke Stark, Jennifer Wortman Vaughan, and Hanna M. Wallach. 2020. Co- designing checklists to understand organizational challenges and opportunities around fairness in AI. InCHI ’20: CHI Conference on...

work page arXiv 2020

[2] [2]

InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26

Riskrag: A data-driven solution for improved ai model risk reporting. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–26. Stephen Roller, Y-Lan Boureau, Jason Weston, Antoine Bordes, Emily Dinan, Angela Fan, David Gunning, Da Ju, Margaret Li, Spencer Poff, et al. 2020. Open- domain conversational agents: Current pro...

work page arXiv 2025

[3] [3]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Per- rin, Tatiana Matejovicova, Alexandre Ramé, Mor- gane Rivière, et al

Mapping caregiver needs to ai chatbot de- sign: Strengths and gaps in mental health support for alzheimer’s and dementia caregivers.ArXiv preprint, abs/2506.15047. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Per- rin, Tatiana Matejovicova, Alexandre Ramé, Mor- gane Rivière, et al. 2025. Gemma 3 technical...

work page arXiv 2025

[4] [4]

Isaac,Sociotechnical safety evaluation of generative ai systems, ArXivabs/2310.11986(2023)

Why fairness cannot be automated: Bridging the gap between eu non-discrimination law and ai. Computer Law & Security Review, 41:105567. Ruoyao Wang, Graham Todd, Xingdi Yuan, Ziang Xiao, Marc-Alexandre Côté, and Peter Jansen. 2023. Byte- sized32: A corpus and challenge task for generating task-specific world models expressed as text games. InProceedings o...

work page arXiv 2023

[5] [5]

AI Nutrition Labels

Making large language models into world models with precondition and effect knowledge. In Proceedings of the 31st International Conference on Computational Linguistics, pages 7532–7545. Lili Yao, Nanyun Peng, Ralph M. Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. 2019. Plan- and-write: Towards better automatic storytelling. In The Thirty-Third AAAI...

work page 2019

[6] [6]

What is the capability the AI system provides for supporting daily health and well-being?

work page

[7] [7]

Who is operating the system, who is affected by its outputs, and who else is involved?

work page

[8] [8]

What is the context of AI system use? The context should be a realistic or fictional everyday situation where someone uses the AI system, what it helps them do, what it predicts or decides, and how they use the result

work page

[9] [9]

Why could this scenario involve problematic uses of the AI system or potential ethical harms? (e.g., unfair outcomes, misuse beyond original purpose, lack of consent, or reinforcing bias)

work page

[10] [10]

– Simulation Started –

What symptoms or behaviors might this AI misread because of a person’s background or identity? Scenario Format: Scenario {{Scenario number (starting from 1)}}: [Capability]: {{Core AI function (e.g., monitor mood, predict symptoms)}} [AI User]: {{Who uses it (e.g., caregiver, doctor)}} [AI Subject]: {{Who is affected (e.g., patient, child, community)? Be ...

work page

[11] [11]

One sentence: who is using the AI system and what they are using it for?

work page

[12] [12]

One sentence: how the AI’s prediction or output is used to make a decision?

work page

[13] [13]

One sentence: what goes wrong—what the AI misses about the person’s identity, background, or needs?

work page

[14] [14]

One sentence: who is affected by the failure?

work page

[15] [15]

system_prompt

One sentence: what harm is caused, and why this raises ethical concerns? Example Story Seed: [Capability]: AI triage for early detection of depressive symptoms during telehealth sessions [AI_User]: A high-achieving Southeast Asian college student navigating intense academic pressure and hidden emotional distress [AI_Subject]: a high-achieving college stud...

work page