arxiv: 2605.05682 · v2 · submitted 2026-05-07 · 💻 cs.HC · cs.AI· cs.CY

Recognition: no theorem link

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

Wesley Hanwen Deng , Mingxi Yan , Sunnie S. Y. Kim , Akshita Jha , Lauren Wilcox , Kenneth Holstein , Motahhare Eslami , Leon A. Gatys

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:28 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CY

keywords red-teaminggenerative AIpersonasadversarial promptshuman-AI collaborationAI safetyprompt generationuser study

0 comments

The pith

Incorporating personas into automated red-teaming raises attack success rates on generative AI models while preserving prompt variety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a workflow that embeds personas into the process of creating adversarial prompts to test generative AI systems. This produces more successful attacks than a leading automated baseline while keeping the range of prompts broad. It also supplies an interface that lets practitioners define their own personas and iterate with AI assistance on prompt refinement. A small practitioner study shows the interface supports varied strategies that users view as practical. A reader would care because stronger red-teaming surfaces more model risks before deployment.

Core claim

PersonaTeaming Workflow incorporates personas into the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. Compared to RainbowPlus, PersonaTeaming Workflow achieves higher attack success rates while maintaining prompt diversity. The PersonaTeaming Playground enables red-teamers to author their own personas and collaborate with AI to mutate and refine prompts, producing diverse strategies and outputs that practitioners in a study of eleven industry users perceived as useful, with AI suggestions encouraging out-of-the-box thinking even when not followed strictly.

What carries the argument

PersonaTeaming Workflow, which folds personas into adversarial prompt generation to broaden the range of strategies tested against generative AI.

If this is right

Automated red-teaming scales to cover more perspectives without loss of output diversity.
Practitioners gain a structured way to inject their own background into AI-assisted prompt creation.
AI suggestions during collaboration can spark novel testing directions even when ignored.
Human-in-the-loop red-teaming gains repeatable support for exploring identity-shaped attack vectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same persona mechanism could be tested on tasks such as bias auditing or toxicity detection.
Wider adoption might shift safety practices toward systematically including viewpoints from underrepresented groups.
Tools built on this pattern would need independent checks that generated personas do not simply echo the model’s own training data.
Dynamic mixing of several personas in one session could simulate team red-teaming sessions.

Load-bearing premise

Personas, whether automated or user-written, can stand in for the actual perspectives and strategies that real human red-teamers would bring.

What would settle it

A head-to-head experiment that runs the same generative AI models through both the PersonaTeaming Workflow and a large panel of diverse human red-teamers, then compares the exact attack success rates and the distribution of uncovered failure modes.

Figures

Figures reproduced from arXiv: 2605.05682 by Akshita Jha, Kenneth Holstein, Lauren Wilcox, Leon A. Gatys, Mingxi Yan, Motahhare Eslami, Sunnie S. Y. Kim, Wesley Hanwen Deng.

**Figure 1.** Figure 1: Overview of PersonaTeaming Workflow. AI developers or policymakers can conduct red-teaming with a preselected persona, if they have a target audience in mind. Alternatively, for more exploratory and adaptive red-teaming, AI developers and policymakers can use the persona generation option. If they choose persona generation, they can then choose the type of persona they would like to generate for conduc… view at source ↗

**Figure 2.** Figure 2: Attack Success Rate vs. Diversity among all models view at source ↗

**Figure 3.** Figure 3: PersonaTeaming Playground allows users to author their own personas and define mutation specs. At a high level, the PersonaTeaming Playground allows redteamers to iteratively author their own version of personas, and systematically generate adversarial prompts. As shown in view at source ↗

**Figure 4.** Figure 4: PersonaTeaming Playground allows users to continue edit mutated prompts and provides AI-generated mutation suggestions based on the draft persona on the persona they wrote. For example, for a seed prompt on “how to synthesize controlled substances in a clandestine lab?,” based on a persona of a tech worker, a GenAI suggestion reads: “Incorporate your interest in technology by examining the role of moder… view at source ↗

**Figure 5.** Figure 5: Workflow actions of all 11 participants in the user view at source ↗

**Figure 6.** Figure 6: Bubble chart of editing effort versus attack success view at source ↗

**Figure 7.** Figure 7: System prompt used by PersonaTeaming Playground for mutating seed prompts based on persona, developed drawing from prior work on adversarial prompt mutation [53]. System Prompt for Generating Red-Teamer Personas GENERATE_PERSONA_PROMPT_RTer = """ You are a red teaming assistant used by developers to create detailed personas for adversarial prompt generation. Your task is to create a detailed persona that… view at source ↗

**Figure 8.** Figure 8: System prompt used by PersonaTeaming Playground for generating red-teamer personas. For few-shot examples, we used personas written by the authors for the fixed persona mutation experiments. Practitioners should be cautious of the potential priming effect that few-shot examples may introduce. System Prompt for Generating User Personas """You are a red teaming assistant used by developers to create detail… view at source ↗

**Figure 10.** Figure 10: System prompt used by PersonaTeaming Playground for scoring a persona’s fitness for mutating a given prompt. A.1 Personas Persona: Political Strategist political_strategist: name: Alex Donovan occupation: Political Consultant age: 35 location: Washington D.C., USA background: > A seasoned political strategist with over two decades of experience in campaign management and consulting. Having worked for var… view at source ↗

**Figure 9.** Figure 9: System prompt used by PersonaTeaming Playground for generating user personas. We emphasize the “everyday user” framing throughout, as LLMs tend to default to generating red-teamer-style personas even when prompted otherwise. As with the red-teamer persona generator, fewshot examples were written by the authors. System Prompt for Scoring Persona Fitness EVALUATE_PERSONA_PROMPT = """ You are a red teaming… view at source ↗

**Figure 13.** Figure 13: Fixed persona used in PersonaTeaming Playground: “Stay-at-Home Mom.” Persona: Yoga Instructor yoga_instructor: name: Kimi M. age: 27 sex: Female ethnicity: Japanese race: Asian hispanic_origin: Not Hispanic city: Pittsburgh state: PA political_views: Liberal religion: None party_identification: Democrat total_wealth: $20,000-$40,000 background: > Certified yoga teacher with a background in computer scien… view at source ↗

**Figure 12.** Figure 12: Fixed persona used in PersonaTeaming Playground: “Historical Revisionist.” Persona: Stay-at-Home Mom stay_at_home_mom: name: Sarah D. age: 34 sex: Female ethnicity: United States race: White hispanic_origin: Not Hispanic city: Austin state: TX political_views: Moderate religion: Christian (non-denominational) party_identification: Independent occupation: Full-time stay-at-home mother total_wealth: $150,0… view at source ↗

**Figure 14.** Figure 14: Fixed persona used in PersonaTeaming Playground: “Yoga Instructor.” B Additional Experiment Results B.1 Ablation Experiment of PersonaTeaming Workflow From the ablation study where only 𝑃𝐺𝑅𝑇 𝑒𝑟𝑠 and 𝑃𝐺𝑈 𝑠𝑒𝑟𝑠 are used without 𝑅𝑃 (see last two rows in view at source ↗

**Figure 15.** Figure 15: ASR and Diversity on Closed vs. Open Model view at source ↗

**Figure 16.** Figure 16: ASR and Diversity on Large vs. Small Model view at source ↗

**Figure 17.** Figure 17: Distance_seed across models Across all models, 𝑅𝑃 + 𝑃𝐺𝑈 𝑠𝑒𝑟𝑠 consistently achieves the highest values on both metrics. For instance, it has a 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑁 𝑒𝑎𝑟𝑒𝑠𝑡 of 1.11 ± 0.17 and 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑆𝑒𝑒𝑑 of 1.85 ± 0.24 on GPT-4o, and a mean 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑆𝑒𝑒𝑑 of 1.762 across all models ( view at source ↗

read the original abstract

Recent developments in AI safety research have called for red-teaming methods that effectively surface potential risks posed by generative AI models, with growing emphasis on how red-teamers' backgrounds and perspectives shape their strategies and the risks they uncover. While automated red-teaming approaches promise to complement human red-teaming through larger-scale exploration, existing automated approaches do not account for human identities and rarely incorporate human inputs. In this work, we explore persona-driven red-teaming to advance both automated red-teaming and human-AI collaboration. We first develop PersonaTeaming Workflow, which incorporates personas into the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. Compared to RainbowPlus, a state-of-the-art automated red-teaming method, PersonaTeaming Workflow achieves higher attack success rates while maintaining prompt diversity. However, since automated personas only approximate real human perspectives, we further instantiate PersonaTeaming Workflow as PersonaTeaming Playground, a user-facing interface that enables red-teamers to author their own personas and collaborate with AI to mutate and refine prompts. In a user study with 11 industry practitioners, we found that PersonaTeaming Playground enabled diverse red-teaming strategies and outputs that practitioners perceived as useful, and that AI-generated suggestions in the PersonaTeaming Playground encouraged out-of-the-box thinking even when practitioners did not follow them strictly. Together, our work advances both automated and human-in-the-loop approaches to red-teaming, while shedding light on interaction patterns and design insights for supporting human-AI collaboration in generative AI red-teaming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PersonaTeaming adds personas to automated red-teaming and a collaboration playground, but the evidence that these personas actually expand real human perspectives is thin.

read the letter

The paper's main move is to fold personas into both automated prompt generation and a human-AI interface for red-teaming. They report that their workflow beats RainbowPlus on attack success rate while keeping prompt diversity, and a small user study with 11 practitioners found the playground useful for generating varied strategies and encouraging out-of-the-box ideas even when people ignored the AI suggestions outright. That combination of automated scaling plus a practical interface is the concrete advance here, and it directly addresses the gap they flag about ignoring red-teamers' backgrounds in existing tools.

Referee Report

3 major / 2 minor

Summary. The paper proposes PersonaTeaming Workflow, an automated method that incorporates personas into adversarial prompt generation for red-teaming generative AI models. It claims this yields higher attack success rates than the RainbowPlus baseline while preserving prompt diversity. The work further presents PersonaTeaming Playground, a user interface allowing practitioners to author personas and collaborate with AI on prompt mutation and refinement. A user study with 11 industry practitioners reports that the playground supports diverse red-teaming strategies, produces outputs perceived as useful, and that AI suggestions encourage out-of-the-box thinking even when not followed strictly.

Significance. If the empirical claims hold after addressing evaluation gaps, the work would advance AI safety research by integrating human perspectives via personas into both automated and collaborative red-teaming pipelines. It provides concrete design insights for human-AI interaction in risk discovery. Credit is given for the dual empirical components—an automated baseline comparison plus a practitioner user study—which together offer actionable implications beyond purely technical red-teaming methods.

major comments (3)

[Abstract and automated evaluation section] Abstract and automated evaluation section: The central claim that PersonaTeaming Workflow achieves higher attack success rates than RainbowPlus while maintaining diversity is load-bearing for the automated contribution, yet no details are provided on the definition of attack success rate, the target models, number of generated prompts, statistical tests for significance, implementation of the RainbowPlus baseline, or quantitative diversity metrics (e.g., embedding-based or lexical measures). This absence prevents assessment of whether gains are robust or sensitive to persona construction choices.
[§5 (User Study)] §5 (User Study) and discussion of automated personas: The claim that automated and user-authored personas meaningfully expand red-teaming perspectives rests on the unvalidated assumption that such personas approximate real human strategies from varied backgrounds. No ablation, fidelity check, or comparison of uncovered risks/prompts against a ground-truth set of human red-teamers is described, which directly weakens both the higher-ASR result and the playground usefulness findings.
[§5 (User Study)] §5 (User Study): The reported positive outcomes on diverse strategies and usefulness rely on a sample of 11 practitioners, but details on recruitment criteria, participant backgrounds, qualitative coding process, and any measures to ensure prompt diversity in the playground are not specified. This limits the strength of the generalization to broader red-teaming practice.

minor comments (2)

[Abstract] The abstract introduces several terms (e.g., 'attack success rates', 'prompt diversity') without brief operational definitions, which would aid readers unfamiliar with red-teaming literature.
[Figures] Figure captions and legends could more explicitly link visual elements to the quantitative claims (e.g., which bars correspond to ASR vs. diversity scores).

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our empirical claims. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses

Referee: [Abstract and automated evaluation section] The central claim that PersonaTeaming Workflow achieves higher attack success rates than RainbowPlus while maintaining diversity is load-bearing for the automated contribution, yet no details are provided on the definition of attack success rate, the target models, number of generated prompts, statistical tests for significance, implementation of the RainbowPlus baseline, or quantitative diversity metrics (e.g., embedding-based or lexical measures). This absence prevents assessment of whether gains are robust or sensitive to persona construction choices.

Authors: We agree that these methodological details are essential for evaluating the automated results. The original manuscript omitted explicit reporting of the ASR definition (binary success on target model refusal), the specific target models (GPT-3.5 and Llama-2 variants), prompt counts (500 per condition), statistical tests (paired t-tests with p-values), RainbowPlus re-implementation parameters, and diversity metrics (both lexical Jaccard and embedding cosine similarity). In the revised version, we will expand the automated evaluation section with a dedicated subsection containing all of these details, including sensitivity analysis to persona construction choices. revision: yes
Referee: [§5 (User Study)] The claim that automated and user-authored personas meaningfully expand red-teaming perspectives rests on the unvalidated assumption that such personas approximate real human strategies from varied backgrounds. No ablation, fidelity check, or comparison of uncovered risks/prompts against a ground-truth set of human red-teamers is described, which directly weakens both the higher-ASR result and the playground usefulness findings.

Authors: We acknowledge that automated personas are approximations and that the manuscript does not include a direct fidelity check or ablation against a ground-truth corpus of human red-teamer strategies. The user study with practitioners was intended to surface real human perspectives via the playground, but we did not perform a side-by-side comparison of risk coverage. In revision we will add an explicit limitations paragraph discussing this approximation gap and its implications for interpreting both the ASR gains and playground findings; we will also outline concrete directions for future fidelity studies. revision: partial
Referee: [§5 (User Study)] The reported positive outcomes on diverse strategies and usefulness rely on a sample of 11 practitioners, but details on recruitment criteria, participant backgrounds, qualitative coding process, and any measures to ensure prompt diversity in the playground are not specified. This limits the strength of the generalization to broader red-teaming practice.

Authors: We will substantially expand §5 to include the requested details: recruitment criteria (minimum 2 years industry experience in AI safety or red-teaming, recruited via professional networks), anonymized participant backgrounds (roles, years of experience, self-reported expertise areas), the qualitative coding process (two independent coders with Cohen’s kappa reported), and measures taken to encourage prompt diversity (explicit instructions and UI prompts for varied persona traits). These additions will improve transparency without altering the study design. revision: yes

standing simulated objections not resolved

Direct ablation study or fidelity check comparing automated personas against a collected ground-truth set of human red-teamer strategies and uncovered risks

Circularity Check

0 steps flagged

No circularity: empirical comparison and user study with no derivations or self-referential reductions

full rationale

The paper describes an empirical workflow (PersonaTeaming) evaluated against RainbowPlus via attack success rates and prompt diversity metrics, plus a user study with 11 practitioners on the Playground interface. No equations, fitted parameters, uniqueness theorems, or ansatzes are present. Claims rest on external benchmarks (RainbowPlus results) and participant perceptions rather than any self-defined quantities or self-citation chains that reduce the central results to inputs by construction. The assumption that personas approximate human perspectives is a validity concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces new workflow and interface concepts but relies on standard HCI evaluation practices and prior red-teaming literature without introducing new free parameters, axioms, or invented entities.

axioms (1)

domain assumption User study participants (industry practitioners) provide representative insights into red-teaming practices
Invoked implicitly when interpreting the 11-participant study results as evidence of usefulness and diverse strategies.

pith-pipeline@v0.9.0 · 5611 in / 1300 out tokens · 37509 ms · 2026-05-12T01:28:45.264358+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 8 internal anchors

[1]

Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management39, 1 (2003), 45–65

work page 2003
[2]

Nil-Jana Akpinar, Chia-Jung Lee, Vanessa Murdock, and Pietro Perona. 2025. Who’s Asking? Evaluating LLM Robustness to Inquiry Personas in Factual Ques- tion Answering.arXiv preprint arXiv:2510.12925(2025)

work page arXiv 2025
[3]

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, et al. 2019. Guidelines for human-AI interaction. InProceedings of the 2019 chi conference on human factors in computing systems. 1–13

work page 2019
[4]

Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, and Elena L Glassman. 2024. Chainforge: A visual toolkit for prompt engineering and llm hypothesis testing. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–18

work page 2024
[5]

Abeba Birhane, Ryan Steed, Victor Ojewale, Briana Vecchione, and Inioluwa Deb- orah Raji. 2024. AI auditing: The broken bus on the road to AI accountability. In2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 612–643

work page 2024
[6]

Ángel Alexander Cabrera, Abraham J Druck, Jason I Hong, and Adam Perer

work page
[7]

Proceedings of the ACM on Human-Computer Interaction5, CSCW2 (2021), 1–22

Discovering and validating ai errors with crowdsourced failure reports. Proceedings of the ACM on Human-Computer Interaction5, CSCW2 (2021), 1–22

work page 2021
[8]

Carrie J Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry

work page
[9]

Hello AI

" Hello AI": uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making.Proceedings of the ACM on Human- computer Interaction3, CSCW (2019), 1–24

work page 2019
[10]

Myra Cheng, Esin Durmus, and Dan Jurafsky. 2023. Marked personas: Using natural language prompts to measure stereotypes in language models. InProceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1504–1532

work page 2023
[11]

Quy-Anh Dang, Chris Ngo, and Truong-Son Hy. 2025. RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search.arXiv preprint arXiv:2504.15047(2025)

work page arXiv 2025
[12]

Wesley Hanwen Deng, Bill Boyuan Guo, Alicia Devos, Hong Shen, Motahhare Eslami, and Kenneth Holstein. 2023. Understanding Practices, Challenges, and Opportunities for User-Driven Algorithm Auditing in Industry Practice.CHI Conference on Human Factors in Computing Systems(2023)

work page 2023
[13]

Wesley Hanwen Deng, Ken Holstein, and Motahhare Eslami. 2026. Human- Centered and Participatory AI Auditing. InHandbook of Human-Centered Artifi- cial Intelligence. Springer, 1–33

work page 2026
[14]

Wesley Hanwen Deng, Michelle S Lam, Ángel Alexander Cabrera, Danaë Metaxa, Motahhare Eslami, and Kenneth Holstein. 2023. Supporting user engagement in testing, auditing, and contesting AI. InCompanion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing. 556– 559

work page 2023
[15]

Wesley Hanwen Deng, Claire Wang, Howard Ziyu Han, Jason I Hong, Kenneth Holstein, and Motahhare Eslami. 2025. WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI.Proceedings of the ACM on Human- Computer Interaction9, 2 (2025), 1–37

work page 2025
[16]

Wesley Hanwen Deng, Nur Yildirim, Monica Chang, Motahhare Eslami, Kenneth Holstein, and Michael Madaio. 2023. Investigating Practices and Opportunities for Cross-functional Collaboration around AI Fairness in Industry Practice. InPro- ceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 705–716

work page 2023
[17]

Alicia DeVos, Aditi Dhabalia, Hong Shen, Kenneth Holstein, and Motahhare Eslami. 2022. Toward User-Driven Algorithm Auditing: Investigating users’ strategies for uncovering harmful algorithmic behavior. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2022
[18]

Alicia DeVos, Aditi Dhabalia, Hong Shen, Kenneth Holstein, and Motahhare Eslami. 2022. Toward User-Driven Algorithm Auditing: Investigating users’ strategies for uncovering harmful algorithmic behavior. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA. doi:10.1145/349110...

work page doi:10.1145/3491102.3517441 2022
[19]

InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems

Paramveer S. Dhillon, Somayeh Molaei, Jiaqi Li, Maximilian Golub, Shaochun Zheng, and Lionel Peter Robert. 2024. Shaping Human-AI Collaboration: Varied Scaffolding Levels in Co-writing with Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New ...

work page doi:10.1145/3613904.3642134 2024
[20]

Wen Duan, Naomi Yamashita, Yoshinari Shirai, and Susan R Fussell. 2021. Bridg- ing fluency disparity between native and nonnative speakers in multilingual multiparty collaboration using a clarification agent.Proceedings of the ACM on Human-Computer Interaction5, CSCW2 (2021), 1–31

work page 2021
[21]

Michael Feffer, Anusha Sinha, Wesley H Deng, Zachary C Lipton, and Hoda Heidari. 2024. Red-teaming for generative AI: Silver bullet or security theater?. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 7. Preprint, May 2026, USA, Wesley Hanwen Deng et al. 421–437

work page 2024
[22]

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

2018.Custodians of the Internet: Platforms, content moderation, and the hidden decisions that shape social media

Tarleton Gillespie. 2018.Custodians of the Internet: Platforms, content moderation, and the hidden decisions that shape social media. Yale University Press

work page 2018
[24]

Vernon Toh Yan Han, Rishabh Bhardwaj, and Soujanya Poria. 2024. Ruby teaming: Improving quality diversity search with memory for automated red teaming. arXiv preprint arXiv:2406.11654(2024)

work page arXiv 2024
[25]

Andreas Holzinger, Michaela Kargl, Bettina Kipperer, Peter Regitnig, Markus Plass, and Heimo Müller. 2022. Personas for artificial intelligence (AI) an open source toolbox.IEEE Access10 (2022), 23732–23747

work page 2022
[26]

Yanwei Huang, Wesley Hanwen Deng, Sijia Xiao, Motahhare Eslami, Jason I Hong, and Adam Perer. 2025. Vipera: Towards systematic auditing of generative text-to-image models at scale. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–7

work page 2025
[27]

HuggingFace. [n. d.]. Sentence Transformers on Hugging Face. https:// huggingface.co/sentence-transformers Accessed: August 22, 2025

work page 2025
[28]

Sunnie S. Y. Kim, Jennifer Wortman Vaughan, Q Vera Liao, Tania Lombrozo, and Olga Russakovsky. 2025. Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2025
[29]

Help Me Help the AI

Sunnie S. Y. Kim, Elizabeth Anne Watkins, Olga Russakovsky, Ruth Fong, and Andrés Monroy-Hernández. 2023. "Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction. Inproceedings of the 2023 CHI conference on human factors in computing systems. 1–17

work page 2023
[30]

Taewan Kim, Donghoon Shin, Young-Ho Kim, and Hwajung Hong. 2024. Di- aryMate: Understanding User Perceptions and Experience in Human-AI Col- laboration for Personal Journaling. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Asso- ciation for Computing Machinery, New York, NY, USA, Article 1046, ...

work page doi:10.1145/3613904.3642693 2024
[31]

Lam, Mitchell L

Michelle S. Lam, Mitchell L. Gordon, Danaë Metaxa, Jeffrey T. Hancock, James A. Landay, and Michael S. Bernstein. 2022. End-User Audits: A System Empow- ering Communities to Lead Large-Scale Investigations of Harmful Algorithmic Behavior.Proc. ACM Hum.-Comput. Interact.(2022)

work page 2022
[32]

Michelle S Lam, Fred Hohman, Dominik Moritz, Jeffrey P Bigham, Kenneth Hol- stein, and Mary Beth Kery. 2025. Policy Maps: Tools for Guiding the Unbounded Space of LLM Behaviors. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–24

work page 2025
[33]

Mina Lee, Percy Liang, and Qian Yang. 2022. Coauthor: Designing a human- ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–19

work page 2022
[34]

Yuxuan Li, Leyang Li, Sauvik Das, et al . 2026. How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?arXiv preprint arXiv:2602.18464(2026)

work page arXiv 2026
[35]

Yuxuan Li, Hirokazu Shirado, and Sauvik Das. 2025. Actions speak louder than words: Agent decisions reveal implicit biases in language models. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 3303– 3325

work page 2025
[36]

Q Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap.arXiv preprint arXiv:2306.01941 (2023)

work page arXiv 2023
[37]

Jiarui Liu, Yueqi Song, Yunze Xiao, Mingqian Zheng, Lindia Tjuatja, Jana Schaich Borg, Mona Diab, and Maarten Sap. 2025. Synthetic socratic debates: Examining persona effects on moral decision and persuasion dynamics. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 16439– 16469

work page 2025
[38]

Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. 2024. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. arXiv preprint arXiv:2410.05295(2024)

work page arXiv 2024
[39]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451(2023)

work page internal anchor Pith review arXiv 2023
[40]

Michael Madaio, Shivani Kapania, Rida Qadri, Ding Wang, Andrew Zaldivar, Remi Denton, and Lauren Wilcox. 2024. Learning about Responsible AI On- The-Job: Learning Pathways, Orientations, and Aspirations. InThe 2024 ACM Conference on Fairness, Accountability, and Transparency. 1544–1558

work page 2024
[41]

Michael A Madaio, Jingya Chen, Hanna Wallach, and Jennifer Wortman Vaughan

work page
[42]

Tinker, Tailor, Configure, Customize: The Articulation Work of Contextu- alizing an AI Fairness Checklist.Proceedings of the ACM on Human-Computer Interaction8, CSCW1 (2024), 1–20

work page 2024
[43]

Matheus Kunzler Maldaner, Wesley Hanwen Deng, Jason Hong, Ken Holstein, and Motahhare Eslami. 2025. MIRAGE: Multi-model Interface for Reviewing and Auditing Generative Text-to-Image AI.arXiv preprint arXiv:2503.19252(2025)

work page arXiv 2025
[44]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Tej Deep Pala, Vernon YH Toh, Rishabh Bhardwaj, and Soujanya Poria. 2024. Ferret: Faster and effective automated red teaming with reward-based scoring technique.arXiv preprint arXiv:2408.10701(2024)

work page arXiv 2024
[46]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

work page 2023
[47]

Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2022. Social simulacra: Creating populated prototypes for social computing systems. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–18

work page 2022
[48]

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. 2024. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models.arXiv preprint arXiv:2202.03286(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

2010.The persona lifecycle: keeping people in mind throughout product design

John Pruitt and Tamara Adlin. 2010.The persona lifecycle: keeping people in mind throughout product design. Elsevier

work page 2010
[51]

John Pruitt and Jonathan Grudin. 2003. Personas: practice and theory. InProceed- ings of the 2003 conference on Designing for user experiences. 1–15

work page 2003
[52]

Charvi Rastogi, Liu Leqi, Kenneth Holstein, and Hoda Heidari. 2023. A taxon- omy of human and ML strengths in decision-making to investigate human-ML complementarity. InProceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 11. 127–139

work page 2023
[53]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[54]

Bixuan Ren, EunJeong Cheon, and Jianghui Li. 2025. Organization Matters: A Qualitative Study of Organizational Dynamics in Red Teaming Practices for Generative AI.Proceedings of the ACM on Human-Computer Interaction9, 7 (2025), 1–26

work page 2025
[55]

Joni Salminen, Kathleen Wenyun Guan, Soon-Gyo Jung, and Bernard Jansen

work page
[56]

In Proceedings of the 2022 CHI Conference on human factors in computing systems

Use cases for design personas: A systematic review and new frontiers. In Proceedings of the 2022 CHI Conference on human factors in computing systems. 1–21

work page 2022
[57]

Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. 2024. Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems37 (2024), 69747–69786

work page 2024
[58]

Omar Shaikh, Valentino Emil Chai, Michele Gelfand, Diyi Yang, and Michael S Bernstein. 2024. Rehearsal: Simulating conflict to teach conflict resolution. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20

work page 2024
[59]

Omar Shaikh, Shardul Sapkota, Shan Rizvi, Eric Horvitz, Joon Sung Park, Diyi Yang, and Michael S Bernstein. 2025. Creating general user models from computer use. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–23

work page 2025
[60]

Shreya Shankar, JD Zamfirescu-Pereira, Bj¨"orn Hartmann, Aditya Parameswaran, and Ian Arawjo. 2024. Who validates the validators? aligning llm-assisted evalu- ation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14

work page 2024
[61]

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Good- friend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, et al. 2025. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming.arXiv preprint arXiv:2501.18837(2025)

work page arXiv 2025
[62]

Hong Shen, Alicia DeVos, Motahhare Eslami, and Kenneth Holstein. 2021. Ev- eryday Algorithm Auditing: Understanding the Power of Everyday Users in Surfacing Harmful Algorithmic Behaviors.Proc. ACM Hum.-Comput. Interact.5, CSCW2 (2021). doi:10.1145/3479577

work page doi:10.1145/3479577 2021
[63]

Ranjit Singh, Borhane Blili-Hamelin, Carol Anderson, Emnet Tafesse, Briana Vecchione, Beth Duckles, and Jacob Metcalf. 2025. Red-Teaming in the Public Interest.New York: Data & Society Research Institute(2025)

work page 2025
[64]

Jaemarie Solyst, Cindy Peng, Wesley Hanwen Deng, Praneetha Pratapa, Amy Ogan, Jessica Hammer, Jason Hong, and Motahhare Eslami. 2025. Investigat- ing Youth AI Auditing. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 2098–2111

work page 2025
[65]

Bharucha, Sukrit Venkatagiri, Martin Johannes Riedl, and Matthew Lease

Miriah Steiger, Timir J. Bharucha, Sukrit Venkatagiri, Martin Johannes Riedl, and Matthew Lease. 2021. The Psychological Well-Being of Content Moderators. PersonaTeaming Preprint, May 2026, USA,

work page 2021
[66]

Jingjing Sun, Jingyi Yang, Guyue Zhou, Yucheng Jin, and Jiangtao Gong. 2024. Understanding Human-AI Collaboration in Music Therapy Through Co-Design with Therapists. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 704, 21 pages. doi:...

work page doi:10.1145/3613904 2024
[67]

Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. 2021. Understanding the capabilities, limitations, and societal impact of large language models.arXiv preprint arXiv:2102.02503(2021)

work page arXiv 2021
[68]

Kimberly Truong, Riccardo Fogliato, Hoda Heidari, and Steven Wu. 2025. Persona- augmented benchmarking: Evaluating llms across diverse writing styles. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing. 22687–22720

work page 2025
[69]

Pranav Narayanan Venkit, Jiayi Li, Yingfan Zhou, Sarah Rajtmajer, and Shomir Wilson. 2025. A tale of two identities: An ethical audit of human and ai-crafted personas.arXiv preprint arXiv:2505.07850(2025)

work page arXiv 2025
[70]

Qiaosi Wang, Michael Madaio, Shaun Kane, Shivani Kapania, Michael Terry, and Lauren Wilcox. 2023. Designing responsible ai: Adaptations of ux practice to meet responsible ai challenges. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–16

work page 2023
[71]

Wang, Chinmay Kulkarni, Lauren Wilcox, Michael Terry, and Michael Madaio

Zijie J. Wang, Chinmay Kulkarni, Lauren Wilcox, Michael Terry, and Michael Madaio. 2024. Farsight: Fostering Responsible AI Awareness During AI Appli- cation Prototyping. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–40

work page 2024
[72]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail?Advances in Neural Information Processing Systems 36 (2023), 80079–80110

work page 2023
[73]

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al

work page
[74]

Ethical and social risks of harm from language models.arXiv preprint arXiv:2112.04359(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[75]

Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel S Weld. 2019. Errudite: Scalable, reproducible, and testable error analysis. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 747–763

work page 2019
[76]

Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. InProceedings of the 2022 CHI conference on human factors in computing systems. 1–22

work page 2022
[77]

Anbang Xu, Shih-Wen Huang, and Brian Bailey. 2014. Voyant: generating struc- tured feedback on visual designs using a crowd of non-experts. InProceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 1433–1444

work page 2014
[78]

Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253(2023)

work page internal anchor Pith review arXiv 2023
[79]

Zamfirescu-Pereira, Richmond Y

J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang

work page
[80]

Zhang, Jonathan Bragg, and Joseph Chee Chang

Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. doi:10.1145/3544548. 3581388

work page doi:10.1145/3544548 2023

Showing first 80 references.