"You tell me": A Dataset of GPT-4-Based Behaviour Change Support Conversations
Pith reviewed 2026-05-24 04:49 UTC · model grok-4.3
The pith
A dataset of user conversations with GPT-4 agents for behavior change support is released, containing transcripts, language analysis, perception measures, and feedback on AI turns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a dataset gathered in a preregistered study of text conversations between users and two GPT-4 conversational agents built for behavior change support; the dataset comprises the raw dialogue records, linguistic annotations of user messages, user ratings of the agents, and written feedback on individual model turns.
What carries the argument
The dataset collected through the preregistered user study with two GPT-4 agents, which records actual user language, perceptions, and feedback during behavior change exchanges.
If this is right
- System designers can examine the conversation logs to identify common user phrasing that influences how the model responds in support contexts.
- The included language analysis highlights measurable features of user messages that correlate with particular perception scores.
- Feedback comments on individual turns supply direct signals for refining prompt construction or response strategies.
- Perception measures allow comparison of how users rate the two different GPT-4 agents within the same study protocol.
Where Pith is reading between the lines
- The dataset could serve as a benchmark for testing whether new behavior change agents reproduce the same user language patterns observed here.
- Researchers might extend the work by running parallel studies with human counselors to isolate differences unique to LLM agents.
- The feedback data might reveal prompt-level adjustments that reduce user frustration without requiring full retraining of the model.
Load-bearing premise
The recorded interactions are representative enough of everyday behavior change conversations that patterns found in them will apply to the design of future systems.
What would settle it
A controlled test in which behavior change systems built with guidance from this dataset show no measurable improvement in user retention or outcome metrics compared with systems built without the dataset would falsify its claimed utility.
Figures
read the original abstract
Conversational agents are increasingly used to address emotional needs on top of information needs. One use case of increasing interest are counselling-style mental health and behaviour change interventions, with large language model (LLM)-based approaches becoming more popular. Research in this context so far has been largely system-focused, foregoing the aspect of user behaviour and the impact this can have on LLM-generated texts. To address this issue, we share a dataset containing text-based user interactions related to behaviour change with two GPT-4-based conversational agents collected in a preregistered user study. This dataset includes conversation data, user language analysis, perception measures, and user feedback for LLM-generated turns, and can offer valuable insights to inform the design of such systems based on real interactions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a dataset of text-based user interactions with two GPT-4-based conversational agents for behaviour change support, collected via a preregistered user study. It includes conversation logs, user language analysis, perception measures, and feedback on LLM-generated turns, positioned as a resource to inform the design of future LLM-based behaviour change systems.
Significance. The dataset release addresses a noted gap between system-focused LLM research and user behaviour in counselling-style interventions. The preregistered study design is a clear strength that enhances credibility. Real-user data of this form can yield design-relevant observations even from a narrow sample, provided the collection process is fully documented.
major comments (1)
- [Abstract] Abstract: the statement that a preregistered user study was conducted supplies no sample size, inclusion criteria, agent prompts, or validation steps. These details are load-bearing for any claim that the dataset can supply useful signals for system design, as they determine the scope and reliability of the collected interactions.
Simulated Author's Rebuttal
We thank the referee for their review and constructive comment. We address the point raised regarding the abstract below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that a preregistered user study was conducted supplies no sample size, inclusion criteria, agent prompts, or validation steps. These details are load-bearing for any claim that the dataset can supply useful signals for system design, as they determine the scope and reliability of the collected interactions.
Authors: We agree that the abstract would be strengthened by briefly indicating key study parameters to help readers evaluate the dataset's scope. The full manuscript already details the preregistered protocol, sample size, inclusion criteria, agent prompts, and validation steps in the Methods section. In the revised version we will add a concise clause to the abstract (e.g., noting the sample size and that prompts and validation are described in the paper) while remaining within typical length limits. revision: yes
Circularity Check
No circularity: dataset release with no derivations or fitted claims
full rationale
The paper releases conversation logs, language analyses, perception measures, and feedback from a preregistered GPT-4 user study. No equations, parameters, predictions, or derivations appear in the abstract or described content. The central claim—that the dataset can supply design-relevant observations—does not reduce to any self-referential construction, self-citation chain, or fitted input renamed as output. This is a standard data contribution whose value is independent of statistical representativeness or modeling assumptions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Exploring language style in chatbots to increase perceived product value and user engagement
Ela Elsholz, Jon Chamberlain, and Udo Kruschwitz. Exploring language style in chatbots to increase perceived product value and user engagement. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, pages 301–305,
work page 2019
-
[2]
Dataset of Natural Language Queries for E-Commerce
Andrea Papenmeier, Dagmar Kern, Daniel Hienert, Alfred Sliwa, Ahmet Aker, and Norbert Fuhr. Dataset of Natural Language Queries for E-Commerce. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval, pages 307–311,
work page 2021
-
[3]
“Mhm...”–Conversational Strategies For Product Search Assistants
Andrea Papenmeier, Alexander Frummet, and Dagmar Kern. “Mhm...”–Conversational Strategies For Product Search Assistants. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval, pages 36–46,
work page 2022
-
[4]
Conversational agents for recipe recommendation
Sabrina Barko-Sherif, David Elsweiler, and Morgan Harvey. Conversational agents for recipe recommendation. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, pages 73–82,
work page 2020
-
[5]
Step-wise recommendation for complex task support
Elnaz Nouri, Robert Sim, Adam Fourney, and Ryen W White. Step-wise recommendation for complex task support. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, pages 203–212,
work page 2020
-
[6]
Making meaning: A focus for information interactions research
Ian Ruthven. Making meaning: A focus for information interactions research. In Proceedings of the 2019 conference on human information interaction and retrieval, pages 163–171,
work page 2019
-
[7]
Designing Supportive Conversational Agents With and For Teens
Irene Lopatovska and Jessika Davis. Designing Supportive Conversational Agents With and For Teens. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval, pages 328–332,
work page 2023
-
[8]
Selina Meyer. “I’m at my wits’ end”-Anticipating Information Needs and Appropriate Support Strategies in Behaviour Change. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval, pages 396–399,
work page 2022
-
[9]
URL https://aclanthology.org/2023.eacl-main.53
Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.53. William R Miller and Stephen Rollnick. Motivational interviewing: Helping people change. Guilford press,
work page 2023
-
[10]
Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction
Ashish Sharma, Kevin Rushton, Inna Wanyin Lin, David Wadden, Khendra G Lucas, Adam S Miner, Theresa Nguyen, and Tim Althoff. Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction. arXiv preprint arXiv:2305.02466, 2023a. Siqi Shen, Charles Welch, Rada Mihalcea, and Verónica Pérez-Rosas. Counseling-Style Reflection Generation Usi...
-
[11]
URL https://aclanthology.org/2020.sigdial-1.2
Association for Computational Linguistics. URL https://aclanthology.org/2020.sigdial-1.2. Ashish Sharma, Inna W Lin, Adam S Miner, David C Atkins, and Tim Althoff. Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1):46–57, 2023b. Emily M. Bender, Timnit Gebru, Angel...
work page 2020
-
[12]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Association for Computing Machinery. ISBN 9781450383097. doi:10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922. Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375,
-
[13]
Ayers, Adam Poliak, Mark Dredze, Eric C
ISSN 2168-6106. doi:10.1001/jamainternmed.2023.1838. URL https://jamanetwork.com/journals/jamainternalmedicine/articlepdf/2804309/jamainternal_ ayers_2023_oi_230030_1681999216.70842.pdf. Yanran Li, Ke Li, Hongke Ning, Xiaoqiang Xia, Yalong Guo, Chen Wei, Jianwei Cui, and Bin Wang. Towards an online empathetic chatbot with emotion causes. In Proceedings of...
-
[14]
Most adults report making some changes to their lifestyle for environmen- tal reasons
Office for National Statistics. Most adults report making some changes to their lifestyle for environmen- tal reasons. URL https://www.ons.gov.uk/peoplepopulationandcommunity/wellbeing/articles/ mostadultsreportmakingsomechangestotheirlifestyleforenvironmentalreasons/2023-07-05. MGM Pinho, JD Mackenbach, Hélène Charreire, J-M Oppert, H Bárdos, K Glonti, H...
work page 2023
-
[15]
Selina Meyer. Natural Language Stage of Change Modelling for “Motivationally-driven” Weight Loss Support. In Proceedings of the 2021 International Conference on Multimodal Interaction, pages 807–811,
work page 2021
-
[16]
Yi-Chieh Lee, Naomi Yamashita, Yun Huang, and Wai Fu. " I hear you, I feel you": encouraging deep self-disclosure through a chatbot. In Proceedings of the 2020 CHI conference on human factors in computing systems, pages 1–12,
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.