Ethics and Social Responsibility in AI-Assisted Interviewing: An LLM-in-the-Loop Study of AI-Generated Follow-Up Questions

He Zhang; Jie Cai; John M. Carroll; Xin Guan; Yueyan Liu

arxiv: 2606.30980 · v1 · pith:BK7LQDVRnew · submitted 2026-06-29 · 💻 cs.HC

Ethics and Social Responsibility in AI-Assisted Interviewing: An LLM-in-the-Loop Study of AI-Generated Follow-Up Questions

He Zhang , Yueyan Liu , Xin Guan , Jie Cai , John M. Carroll This is my paper

Pith reviewed 2026-07-01 00:53 UTC · model grok-4.3

classification 💻 cs.HC

keywords AI ethicsLLM-assisted interviewingWizard-of-Oz studyfollow-up questionssocial responsibilityprivacy risksqualitative methodsdiscriminatory language

0 comments

The pith

A Wizard-of-Oz study finds five interlocking ethical concerns when LLMs generate follow-up questions during live interviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a simulated AI follow-up assistant, powered by GPT-4o and mediated by a human co-interviewer, surfaces five specific worries among 17 interviewers of varying expertise. These worries center on harmful language, diminished respect for interviewees, unequal participation, unclear accountability, and privacy exposures. A sympathetic reader would care because semi-structured interviews are widely used in research, hiring, and social services, where AI tools are increasingly proposed to reduce cognitive load. The study translates the concerns directly into recommendations for design choices and governance structures that could make such systems safer.

Core claim

In an LLM-in-the-loop Wizard-of-Oz study, a human co-interviewer selectively relayed and could edit real-time AI-generated follow-up questions produced by GPT-4o. Across 17 participants, five interlocking concerns emerged: (1) harmful or discriminatory language and unpredictable interaction harms, (2) undermining interviewees' sense of respect through divided attention and missing nonverbal cues, (3) technology-based participation inequality, (4) unclear responsibility when harms occur, and (5) privacy, disclosure, and compliance risks when AI listens, records, or transcribes sensitive content. The authors translate these concerns into design and governance implications for safer, more respe

What carries the argument

The selective-relay Wizard-of-Oz setup in which a human co-interviewer edits and relays GPT-4o-generated questions during live interviews while preserving human oversight.

If this is right

Systems must incorporate safeguards against discriminatory or harmful language in generated questions.
Designs should minimize interviewer distraction to preserve respect and attention toward interviewees.
Clear accountability protocols are required to assign responsibility when AI outputs cause harm.
Privacy and compliance features must handle sensitive content when AI records or transcribes interviews.
Solutions for technology access are needed to avoid creating participation inequality among interviewees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same five concerns could appear in other AI-mediated qualitative data collection settings such as focus groups or oral histories.
A side-by-side comparison of fully automated versus human-mediated versions would test whether the observed concerns are artifacts of the Wizard-of-Oz mediation.
Policy makers could use these concerns as a starting list when drafting domain-specific guidelines for AI tools in human-subject research.
Training programs for interviewers might reduce some risks by teaching effective collaboration with AI assistants.

Load-bearing premise

The selective-relay Wizard-of-Oz setup with a human co-interviewer editing AI-generated questions produces concerns that accurately reflect those that would arise in actual deployed AI-assisted interviewing systems without full automation.

What would settle it

Deploy a fully automated version of the same AI follow-up system with no human editing or relay step, then measure whether participants still report the same five concerns in the same proportions.

Figures

Figures reproduced from arXiv: 2606.30980 by He Zhang, Jie Cai, John M. Carroll, Xin Guan, Yueyan Liu.

**Figure 1.** Figure 1: LLM-in-the-loop Wizard-of-Oz study design for AI-assisted qualitative interviewing (study schematic). The lead [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

read the original abstract

Semi-structured interviews rely on timely, context-sensitive follow-up questions, yet interviewers' cognitive load and limited domain familiarity can constrain probing depth. We report findings from an LLM-in-the-loop Wizard-of-Oz (WoZ) study that simulates an AI follow-up assistant in live interviewing while preserving human oversight. In our setup, a co-interviewer selectively relayed and could edit AI-generated follow-up questions (AGQs) produced in real time by GPT-4o, enabling a realistic approximation of deployment without fully automating the interaction. Across 17 interviewers with varied qualitative-method expertise, participants raised five interlocking concerns: (1) harmful or discriminatory language and unpredictable interaction harms, (2) undermining interviewees' sense of respect through divided attention and missing nonverbal cues, (3) technology-based participation inequality, (4) unclear responsibility when harms occur, and (5) privacy, disclosure, and compliance risks when AI listens, records, or transcribes sensitive content. We translate these concerns into design and governance implications for safer, more respectful, and more accountable AI-assisted interviewing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The WoZ setup with human editing of AI questions likely understates the harmful-language risks for real automated deployments.

read the letter

The main point for you is that this paper's Wizard-of-Oz design, where a human co-interviewer can edit or drop GPT-4o follow-up questions before they reach the interviewee, probably filters out some of the worst outputs. That makes the reported concerns, especially around discriminatory language, look milder than they would in a fully automated system.

What the work actually does is run live interviews with 17 people of varying qualitative experience and collect their reactions to the AI-assisted probing. It surfaces five linked issues: unpredictable harmful phrasing, divided attention that can feel disrespectful, tech-driven inequality in participation, fuzzy accountability for problems, and privacy exposure from AI listening or transcribing. The authors then map those to some design and governance suggestions.

The empirical angle on this narrow use case is new enough to note, and the study gives concrete participant voices rather than just abstract ethics talk. That part is useful.

The soft spot is the missing data on how often the human actually edited or rejected questions. Without those counts, it's hard to judge how representative the observed harms are. The abstract also leaves out recruitment, coding, and validation steps, which makes it tougher to gauge robustness right away.

This is for HCI researchers or practitioners building or evaluating AI tools for interviews, hiring, or similar conversational work. A reader focused on responsible deployment in qualitative methods would find it relevant.

I'd send it to peer review. The topic is timely and the empirical piece is direct, even if the WoZ limitation and methods transparency need work.

Referee Report

2 major / 1 minor

Summary. The manuscript reports findings from an LLM-in-the-loop Wizard-of-Oz study simulating AI-assisted semi-structured interviewing. Using GPT-4o to generate follow-up questions in real time, with a human co-interviewer selectively relaying and editing them, the study with 17 interviewers of varying expertise identifies five interlocking participant concerns—harmful language, undermined respect, participation inequality, unclear responsibility, and privacy risks—and translates these into design and governance implications.

Significance. If the concerns are robustly supported and representative of deployed systems, the work offers timely empirical grounding for ethical considerations in AI tools for qualitative interviewing and human-computer interaction, highlighting risks that could inform safer system design and policy.

major comments (2)

[Study Setup (Abstract)] Study Setup (Abstract): The selective-relay WoZ design permits the co-interviewer to edit or reject AI-generated questions before relay. This filtering mechanism may reduce the observed incidence of harmful or discriminatory language (concern 1) compared to fully automated deployment. The manuscript does not report the fraction of AGQs edited or rejected, leaving unquantified how well the observed concerns map to real AI-assisted systems without human mediation.
[Methods] Methods: No details are provided on participant recruitment, interview protocol, data analysis method, or the process for coding and validating the five concerns. This absence limits evaluation of whether the reported concerns are robustly derived from the data.

minor comments (1)

[Abstract] Abstract: Consider adding the total number of interviews conducted or AGQs generated to contextualize the scale of the qualitative findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each of the major comments below.

read point-by-point responses

Referee: [Study Setup (Abstract)] Study Setup (Abstract): The selective-relay WoZ design permits the co-interviewer to edit or reject AI-generated questions before relay. This filtering mechanism may reduce the observed incidence of harmful or discriminatory language (concern 1) compared to fully automated deployment. The manuscript does not report the fraction of AGQs edited or rejected, leaving unquantified how well the observed concerns map to real AI-assisted systems without human mediation.

Authors: We acknowledge the referee's point that the selective-relay WoZ design introduces human mediation, which could potentially mitigate some of the observed concerns, particularly harmful language, relative to a fully automated system. This design was chosen to ethically simulate AI assistance while maintaining human oversight, consistent with the 'LLM-in-the-Loop' framing of the study. Although we did not log the precise fraction of AI-generated questions that were edited or rejected (our data collection prioritized interviewer reflections over quantitative system metrics), we agree that this limits direct mapping to unmediated deployments. In the revised manuscript, we will add a dedicated discussion of this limitation, including how the five concerns might manifest differently without human filtering, and suggest avenues for future fully automated studies under appropriate safeguards. revision: yes
Referee: [Methods] Methods: No details are provided on participant recruitment, interview protocol, data analysis method, or the process for coding and validating the five concerns. This absence limits evaluation of whether the reported concerns are robustly derived from the data.

Authors: We appreciate this feedback and agree that the Methods section requires greater detail for transparency and replicability. The original manuscript includes descriptions of these elements, but they may not have been sufficiently prominent or detailed. We will revise the Methods section to explicitly cover: (1) participant recruitment via university networks and professional lists, yielding 17 interviewers with diverse expertise levels; (2) the semi-structured interview protocol, including topics discussed; (3) the data analysis approach using reflexive thematic analysis; and (4) the coding and validation process for the five concerns, involving multiple coders and consensus-building. These revisions will ensure the robustness of our findings is clear. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical qualitative study with no derivations or self-referential predictions

full rationale

The paper reports participant concerns from a Wizard-of-Oz study with 17 interviewers; its claims rest on transcribed feedback rather than any equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support a central result. The setup assumption (human editing approximates deployment) is an acknowledged methodological limit but does not create definitional or statistical circularity in the reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical qualitative study and introduces no free parameters, mathematical axioms, or invented entities. It rests on standard domain assumptions about the validity of self-reported concerns in simulated technology-use scenarios.

axioms (1)

domain assumption Participant self-reports in a simulated AI-assisted interview setting can surface meaningful and generalizable ethical concerns about real deployment.
The study treats the five concerns as actionable for design and governance without additional validation data.

pith-pipeline@v0.9.1-grok · 5733 in / 1268 out tokens · 43685 ms · 2026-07-01T00:53:56.067805+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 14 canonical work pages

[1]

W. C. Adams. 2015. Conducting semi-structured interviews. InHandbook of Practical Program Evaluation, K. E. Newcomer, H. P. Hatry, and J. S. Wholey (Eds.). Jossey-Bass, 492–505. https://doi.org/10.1002/9781119171386.ch19

work page doi:10.1002/9781119171386.ch19 2015
[2]

Uwe. Flick. 2006.An Introduction to Qualitative Research. SAGE Publications. https://books.google.com/books?id=t45GmKMZp0MC

2006
[3]

National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research

United States. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. 1978.The Belmont Report: Ethical Princi- ples and Guidelines for the Protection of Human Subjects of Research : Appendix. Number v. 2 in DHEW publication ; no. (OS) 78-0013. Department of Health, Edu- cation, and Welfare, National Commission for ...

1978
[4]

Koji Inoue, Kohei Hara, Divesh Lala, Kenta Yamamoto, Shizuka Nakamura, Katsuya Takanashi, and Tatsuya Kawahara. 2020. Job Interviewer Android with Elaborate Follow-up Question Generation. InProceedings of the 2020 In- ternational Conference on Multimodal Interaction(Virtual Event, Netherlands) (ICMI ’20). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3382507.3418839 2020
[5]

Shazia Jamshed. 2014. Qualitative research method-interviewing and observation. Journal of basic and clinical pharmacy5, 4 (2014), 87. https://doi.org/10.4103/0976- 0105.141942

work page doi:10.4103/0976- 2014
[6]

Orit Karnieli-Miller, Roni Strier, and Liat Pessach. 2009. Power relations in qualitative research.Qualitative health research19, 2 (2009), 279–289. https: //doi.org/10.1177/1049732308329306

work page doi:10.1177/1049732308329306 2009
[7]

Nigel King, Joanna Brooks, and Christine Horrocks. 2019. Interviews in qualitative research. (2019). https://doi.org/10.4135/9781036234881

work page doi:10.4135/9781036234881 2019
[8]

Steinar Kvale. 2006. Dominance through interviews and dialogues.Qualitative inquiry12, 3 (2006), 480–500. https://doi.org/10.1177/1077800406286235

work page doi:10.1177/1077800406286235 2006
[9]

Chee Wee Leong, Navaneeth Jawahar, Vinay Basheerabad, Torsten Wörtwein, Andrew Emerson, and Guy Sivan. 2024. Combining Generative and Discrimi- native AI for High-Stakes Interview Practice. InCompanion Proceedings of the 26th International Conference on Multimodal Interaction(San Jose, Costa Rica) (ICMI Companion ’24). Association for Computing Machinery,...

work page doi:10.1145/3686215.3688377 2024
[10]

Bingjie Liu, Lewen Wei, Mu Wu, and Tianyi Luo. 2023. Speech production under uncertainty: how do job applicants experience and communicate with an AI interviewer?Journal of Computer-Mediated Communication28, 4 (2023), zmad028. https://doi.org/10.1093/jcmc/zmad028

work page doi:10.1093/jcmc/zmad028 2023
[11]

Zhe Liu, Jiamin Dai, Cristina Conati, and Joanna McGrenere. 2025. Envisioning AI Support during Semi-Structured Interviews Across the Expertise Spectrum. Proceedings of the ACM on Human-Computer Interaction9, 2, Article CSCW011 (May 2025), 29 pages. https://doi.org/10.1145/3710909

work page doi:10.1145/3710909 2025
[12]

Marshall and G.B

C. Marshall and G.B. Rossman. 2014.Designing Qualitative Research. SAGE Publications. https://books.google.com/books?id=-zncBQAAQBAJ

2014
[13]

Kevin R. McKee. 2024. Human Participants in AI Research: Ethics and Trans- parency in Practice.IEEE Transactions on Technology and Society5, 3 (2024), 279–288. https://doi.org/10.1109/TTS.2024.3446183

work page doi:10.1109/tts.2024.3446183 2024
[14]

Roulston

K. Roulston. 2010.Reflective Interviewing: A Guide to Theory and Practice. SAGE. https://doi.org/10.4135/9781446288009

work page doi:10.4135/9781446288009 2010
[15]

Alexander Spangher, Michael Lu, Sriya Kalyan, Hyundong Justin Cho, Tenghao Huang, Weiyan Shi, and Jonathan May. 2025. NewsInterview: a Dataset and a Playground to Evaluate LLMs’ Grounding Gap via Informational Interviews. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce ...

work page doi:10.18653/v1/2025.acl- 2025
[16]

Kuldeep Yadav, Animesh Seemendra, Abhishek Singhania, Sagar Bora, Pratyaksh Dubey, and Varun Aggarwal. 2023. Interviewing the Interviewer: AI-generated Insights to Help Conduct Candidate-centric Interviews. InProceedings of the 28th International Conference on Intelligent User Interfaces(Sydney, NSW, Australia) (IUI ’23). Association for Computing Machine...

work page doi:10.1145/3581641.3584051 2023
[17]

He Zhang, Yueyan Liu, Xin Guan, Jie Cai, and John M. Carroll. 2025. Harnessing the Power of AI in Qualitative Research: Role Assignment, Engagement, and User Perceptions of AI-Generated Follow-Up Questions in Semi-Structured Interviews. arXiv:2509.12709 [cs.HC] https://arxiv.org/abs/2509.12709

work page arXiv 2025

[1] [1]

W. C. Adams. 2015. Conducting semi-structured interviews. InHandbook of Practical Program Evaluation, K. E. Newcomer, H. P. Hatry, and J. S. Wholey (Eds.). Jossey-Bass, 492–505. https://doi.org/10.1002/9781119171386.ch19

work page doi:10.1002/9781119171386.ch19 2015

[2] [2]

Uwe. Flick. 2006.An Introduction to Qualitative Research. SAGE Publications. https://books.google.com/books?id=t45GmKMZp0MC

2006

[3] [3]

National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research

United States. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. 1978.The Belmont Report: Ethical Princi- ples and Guidelines for the Protection of Human Subjects of Research : Appendix. Number v. 2 in DHEW publication ; no. (OS) 78-0013. Department of Health, Edu- cation, and Welfare, National Commission for ...

1978

[4] [4]

Koji Inoue, Kohei Hara, Divesh Lala, Kenta Yamamoto, Shizuka Nakamura, Katsuya Takanashi, and Tatsuya Kawahara. 2020. Job Interviewer Android with Elaborate Follow-up Question Generation. InProceedings of the 2020 In- ternational Conference on Multimodal Interaction(Virtual Event, Netherlands) (ICMI ’20). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3382507.3418839 2020

[5] [5]

Shazia Jamshed. 2014. Qualitative research method-interviewing and observation. Journal of basic and clinical pharmacy5, 4 (2014), 87. https://doi.org/10.4103/0976- 0105.141942

work page doi:10.4103/0976- 2014

[6] [6]

Orit Karnieli-Miller, Roni Strier, and Liat Pessach. 2009. Power relations in qualitative research.Qualitative health research19, 2 (2009), 279–289. https: //doi.org/10.1177/1049732308329306

work page doi:10.1177/1049732308329306 2009

[7] [7]

Nigel King, Joanna Brooks, and Christine Horrocks. 2019. Interviews in qualitative research. (2019). https://doi.org/10.4135/9781036234881

work page doi:10.4135/9781036234881 2019

[8] [8]

Steinar Kvale. 2006. Dominance through interviews and dialogues.Qualitative inquiry12, 3 (2006), 480–500. https://doi.org/10.1177/1077800406286235

work page doi:10.1177/1077800406286235 2006

[9] [9]

Chee Wee Leong, Navaneeth Jawahar, Vinay Basheerabad, Torsten Wörtwein, Andrew Emerson, and Guy Sivan. 2024. Combining Generative and Discrimi- native AI for High-Stakes Interview Practice. InCompanion Proceedings of the 26th International Conference on Multimodal Interaction(San Jose, Costa Rica) (ICMI Companion ’24). Association for Computing Machinery,...

work page doi:10.1145/3686215.3688377 2024

[10] [10]

Bingjie Liu, Lewen Wei, Mu Wu, and Tianyi Luo. 2023. Speech production under uncertainty: how do job applicants experience and communicate with an AI interviewer?Journal of Computer-Mediated Communication28, 4 (2023), zmad028. https://doi.org/10.1093/jcmc/zmad028

work page doi:10.1093/jcmc/zmad028 2023

[11] [11]

Zhe Liu, Jiamin Dai, Cristina Conati, and Joanna McGrenere. 2025. Envisioning AI Support during Semi-Structured Interviews Across the Expertise Spectrum. Proceedings of the ACM on Human-Computer Interaction9, 2, Article CSCW011 (May 2025), 29 pages. https://doi.org/10.1145/3710909

work page doi:10.1145/3710909 2025

[12] [12]

Marshall and G.B

C. Marshall and G.B. Rossman. 2014.Designing Qualitative Research. SAGE Publications. https://books.google.com/books?id=-zncBQAAQBAJ

2014

[13] [13]

Kevin R. McKee. 2024. Human Participants in AI Research: Ethics and Trans- parency in Practice.IEEE Transactions on Technology and Society5, 3 (2024), 279–288. https://doi.org/10.1109/TTS.2024.3446183

work page doi:10.1109/tts.2024.3446183 2024

[14] [14]

Roulston

K. Roulston. 2010.Reflective Interviewing: A Guide to Theory and Practice. SAGE. https://doi.org/10.4135/9781446288009

work page doi:10.4135/9781446288009 2010

[15] [15]

Alexander Spangher, Michael Lu, Sriya Kalyan, Hyundong Justin Cho, Tenghao Huang, Weiyan Shi, and Jonathan May. 2025. NewsInterview: a Dataset and a Playground to Evaluate LLMs’ Grounding Gap via Informational Interviews. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce ...

work page doi:10.18653/v1/2025.acl- 2025

[16] [16]

Kuldeep Yadav, Animesh Seemendra, Abhishek Singhania, Sagar Bora, Pratyaksh Dubey, and Varun Aggarwal. 2023. Interviewing the Interviewer: AI-generated Insights to Help Conduct Candidate-centric Interviews. InProceedings of the 28th International Conference on Intelligent User Interfaces(Sydney, NSW, Australia) (IUI ’23). Association for Computing Machine...

work page doi:10.1145/3581641.3584051 2023

[17] [17]

He Zhang, Yueyan Liu, Xin Guan, Jie Cai, and John M. Carroll. 2025. Harnessing the Power of AI in Qualitative Research: Role Assignment, Engagement, and User Perceptions of AI-Generated Follow-Up Questions in Semi-Structured Interviews. arXiv:2509.12709 [cs.HC] https://arxiv.org/abs/2509.12709

work page arXiv 2025