"You tell me": A Dataset of GPT-4-Based Behaviour Change Support Conversations

David Elsweiler; Selina Meyer

arxiv: 2401.16167 · v2 · submitted 2024-01-29 · 💻 cs.HC · cs.CL

"You tell me": A Dataset of GPT-4-Based Behaviour Change Support Conversations

Selina Meyer , David Elsweiler This is my paper

Pith reviewed 2026-05-24 04:49 UTC · model grok-4.3

classification 💻 cs.HC cs.CL

keywords behavior changeconversational agentsGPT-4datasetuser studyLLMhuman-computer interactionmental health support

0 comments

The pith

A dataset of user conversations with GPT-4 agents for behavior change support is released, containing transcripts, language analysis, perception measures, and feedback on AI turns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects and shares data from a preregistered user study in which participants held text-based conversations with two GPT-4-based agents on behavior change topics. The released materials include full conversation logs, breakdowns of user language, quantitative perception ratings, and direct comments on the quality of the model-generated replies. Earlier work on LLM counseling tools has centered on system performance while leaving aside how actual users speak and steer the exchanges. Access to these real interactions supplies concrete material for examining the interplay between user input and generated support text.

Core claim

The authors present a dataset gathered in a preregistered study of text conversations between users and two GPT-4 conversational agents built for behavior change support; the dataset comprises the raw dialogue records, linguistic annotations of user messages, user ratings of the agents, and written feedback on individual model turns.

What carries the argument

The dataset collected through the preregistered user study with two GPT-4 agents, which records actual user language, perceptions, and feedback during behavior change exchanges.

If this is right

System designers can examine the conversation logs to identify common user phrasing that influences how the model responds in support contexts.
The included language analysis highlights measurable features of user messages that correlate with particular perception scores.
Feedback comments on individual turns supply direct signals for refining prompt construction or response strategies.
Perception measures allow comparison of how users rate the two different GPT-4 agents within the same study protocol.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could serve as a benchmark for testing whether new behavior change agents reproduce the same user language patterns observed here.
Researchers might extend the work by running parallel studies with human counselors to isolate differences unique to LLM agents.
The feedback data might reveal prompt-level adjustments that reduce user frustration without requiring full retraining of the model.

Load-bearing premise

The recorded interactions are representative enough of everyday behavior change conversations that patterns found in them will apply to the design of future systems.

What would settle it

A controlled test in which behavior change systems built with guidance from this dataset show no measurable improvement in user retention or outcome metrics compared with systems built without the dataset would falsify its claimed utility.

Figures

Figures reproduced from arXiv: 2401.16167 by David Elsweiler, Selina Meyer.

read the original abstract

Conversational agents are increasingly used to address emotional needs on top of information needs. One use case of increasing interest are counselling-style mental health and behaviour change interventions, with large language model (LLM)-based approaches becoming more popular. Research in this context so far has been largely system-focused, foregoing the aspect of user behaviour and the impact this can have on LLM-generated texts. To address this issue, we share a dataset containing text-based user interactions related to behaviour change with two GPT-4-based conversational agents collected in a preregistered user study. This dataset includes conversation data, user language analysis, perception measures, and user feedback for LLM-generated turns, and can offer valuable insights to inform the design of such systems based on real interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a dataset release of GPT-4 behavior-change conversations from a preregistered study, useful for filling a data gap but thin on collection details in the abstract.

read the letter

The main point is that the paper releases a dataset of text-based user interactions with two GPT-4 agents on behavior change, collected in a preregistered study. It bundles the conversation logs with language analysis, perception measures, and user feedback on the generated turns. This kind of real interaction data is scarce in the HCI and conversational agent literature, so the release itself is the core contribution. The preregistration and multi-faceted annotations are clear positives that add some structure and potential for reuse. The paper does a straightforward job of stating the motivation and what the dataset contains without overclaiming. The soft spot is that the abstract supplies no sample size, inclusion criteria, agent prompts, or validation steps, which leaves the data-generation process hard to evaluate from the available text. That said, the stress-test note is correct: the claim does not require the sample to be statistically representative of all behavior-change conversations, so even a narrower set of real logs can still yield design-relevant observations. No load-bearing flaws jump out. This paper is for researchers in HCI or LLM-based agents who need actual user data to study safety, effectiveness, or system design in mental-health-adjacent settings. A reader working on evaluation benchmarks or data-driven improvements would get direct value. It shows honest engagement with the literature gap and deserves a serious referee because dataset papers with preregistration can move the field if the data proves usable.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a dataset of text-based user interactions with two GPT-4-based conversational agents for behaviour change support, collected via a preregistered user study. It includes conversation logs, user language analysis, perception measures, and feedback on LLM-generated turns, positioned as a resource to inform the design of future LLM-based behaviour change systems.

Significance. The dataset release addresses a noted gap between system-focused LLM research and user behaviour in counselling-style interventions. The preregistered study design is a clear strength that enhances credibility. Real-user data of this form can yield design-relevant observations even from a narrow sample, provided the collection process is fully documented.

major comments (1)

[Abstract] Abstract: the statement that a preregistered user study was conducted supplies no sample size, inclusion criteria, agent prompts, or validation steps. These details are load-bearing for any claim that the dataset can supply useful signals for system design, as they determine the scope and reliability of the collected interactions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and constructive comment. We address the point raised regarding the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that a preregistered user study was conducted supplies no sample size, inclusion criteria, agent prompts, or validation steps. These details are load-bearing for any claim that the dataset can supply useful signals for system design, as they determine the scope and reliability of the collected interactions.

Authors: We agree that the abstract would be strengthened by briefly indicating key study parameters to help readers evaluate the dataset's scope. The full manuscript already details the preregistered protocol, sample size, inclusion criteria, agent prompts, and validation steps in the Methods section. In the revised version we will add a concise clause to the abstract (e.g., noting the sample size and that prompts and validation are described in the paper) while remaining within typical length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with no derivations or fitted claims

full rationale

The paper releases conversation logs, language analyses, perception measures, and feedback from a preregistered GPT-4 user study. No equations, parameters, predictions, or derivations appear in the abstract or described content. The central claim—that the dataset can supply design-relevant observations—does not reduce to any self-referential construction, self-citation chain, or fitted input renamed as output. This is a standard data contribution whose value is independent of statistical representativeness or modeling assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper rests on the premise that a preregistered user study with GPT-4 agents produced usable interaction data for system design; no free parameters, mathematical axioms, or new invented entities are introduced.

pith-pipeline@v0.9.0 · 5651 in / 1129 out tokens · 25175 ms · 2026-05-24T04:49:26.546710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Exploring language style in chatbots to increase perceived product value and user engagement

Ela Elsholz, Jon Chamberlain, and Udo Kruschwitz. Exploring language style in chatbots to increase perceived product value and user engagement. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, pages 301–305,

work page 2019
[2]

Dataset of Natural Language Queries for E-Commerce

Andrea Papenmeier, Dagmar Kern, Daniel Hienert, Alfred Sliwa, Ahmet Aker, and Norbert Fuhr. Dataset of Natural Language Queries for E-Commerce. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval, pages 307–311,

work page 2021
[3]

“Mhm...”–Conversational Strategies For Product Search Assistants

Andrea Papenmeier, Alexander Frummet, and Dagmar Kern. “Mhm...”–Conversational Strategies For Product Search Assistants. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval, pages 36–46,

work page 2022
[4]

Conversational agents for recipe recommendation

Sabrina Barko-Sherif, David Elsweiler, and Morgan Harvey. Conversational agents for recipe recommendation. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, pages 73–82,

work page 2020
[5]

Step-wise recommendation for complex task support

Elnaz Nouri, Robert Sim, Adam Fourney, and Ryen W White. Step-wise recommendation for complex task support. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, pages 203–212,

work page 2020
[6]

Making meaning: A focus for information interactions research

Ian Ruthven. Making meaning: A focus for information interactions research. In Proceedings of the 2019 conference on human information interaction and retrieval, pages 163–171,

work page 2019
[7]

Designing Supportive Conversational Agents With and For Teens

Irene Lopatovska and Jessika Davis. Designing Supportive Conversational Agents With and For Teens. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval, pages 328–332,

work page 2023
[8]

I’m at my wits’ end

Selina Meyer. “I’m at my wits’ end”-Anticipating Information Needs and Appropriate Support Strategies in Behaviour Change. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval, pages 396–399,

work page 2022
[9]

URL https://aclanthology.org/2023.eacl-main.53

Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.53. William R Miller and Stephen Rollnick. Motivational interviewing: Helping people change. Guilford press,

work page 2023
[10]

Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction

Ashish Sharma, Kevin Rushton, Inna Wanyin Lin, David Wadden, Khendra G Lucas, Adam S Miner, Theresa Nguyen, and Tim Althoff. Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction. arXiv preprint arXiv:2305.02466, 2023a. Siqi Shen, Charles Welch, Rada Mihalcea, and Verónica Pérez-Rosas. Counseling-Style Reflection Generation Usi...

work page arXiv
[11]

URL https://aclanthology.org/2020.sigdial-1.2

Association for Computational Linguistics. URL https://aclanthology.org/2020.sigdial-1.2. Ashish Sharma, Inna W Lin, Adam S Miner, David C Atkins, and Tim Althoff. Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1):46–57, 2023b. Emily M. Bender, Timnit Gebru, Angel...

work page 2020
[12]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Association for Computing Machinery. ISBN 9781450383097. doi:10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922. Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375,

work page doi:10.1145/3442188.3445922
[13]

Ayers, Adam Poliak, Mark Dredze, Eric C

ISSN 2168-6106. doi:10.1001/jamainternmed.2023.1838. URL https://jamanetwork.com/journals/jamainternalmedicine/articlepdf/2804309/jamainternal_ ayers_2023_oi_230030_1681999216.70842.pdf. Yanran Li, Ke Li, Hongke Ning, Xiaoqiang Xia, Yalong Guo, Chen Wei, Jianwei Cui, and Bin Wang. Towards an online empathetic chatbot with emotion causes. In Proceedings of...

work page doi:10.1001/jamainternmed.2023.1838 2023
[14]

Most adults report making some changes to their lifestyle for environmen- tal reasons

Office for National Statistics. Most adults report making some changes to their lifestyle for environmen- tal reasons. URL https://www.ons.gov.uk/peoplepopulationandcommunity/wellbeing/articles/ mostadultsreportmakingsomechangestotheirlifestyleforenvironmentalreasons/2023-07-05. MGM Pinho, JD Mackenbach, Hélène Charreire, J-M Oppert, H Bárdos, K Glonti, H...

work page 2023
[15]

Motivationally-driven

Selina Meyer. Natural Language Stage of Change Modelling for “Motivationally-driven” Weight Loss Support. In Proceedings of the 2021 International Conference on Multimodal Interaction, pages 807–811,

work page 2021
[16]

I hear you, I feel you

Yi-Chieh Lee, Naomi Yamashita, Yun Huang, and Wai Fu. " I hear you, I feel you": encouraging deep self-disclosure through a chatbot. In Proceedings of the 2020 CHI conference on human factors in computing systems, pages 1–12,

work page 2020

[1] [1]

Exploring language style in chatbots to increase perceived product value and user engagement

Ela Elsholz, Jon Chamberlain, and Udo Kruschwitz. Exploring language style in chatbots to increase perceived product value and user engagement. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, pages 301–305,

work page 2019

[2] [2]

Dataset of Natural Language Queries for E-Commerce

Andrea Papenmeier, Dagmar Kern, Daniel Hienert, Alfred Sliwa, Ahmet Aker, and Norbert Fuhr. Dataset of Natural Language Queries for E-Commerce. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval, pages 307–311,

work page 2021

[3] [3]

“Mhm...”–Conversational Strategies For Product Search Assistants

Andrea Papenmeier, Alexander Frummet, and Dagmar Kern. “Mhm...”–Conversational Strategies For Product Search Assistants. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval, pages 36–46,

work page 2022

[4] [4]

Conversational agents for recipe recommendation

Sabrina Barko-Sherif, David Elsweiler, and Morgan Harvey. Conversational agents for recipe recommendation. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, pages 73–82,

work page 2020

[5] [5]

Step-wise recommendation for complex task support

Elnaz Nouri, Robert Sim, Adam Fourney, and Ryen W White. Step-wise recommendation for complex task support. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, pages 203–212,

work page 2020

[6] [6]

Making meaning: A focus for information interactions research

Ian Ruthven. Making meaning: A focus for information interactions research. In Proceedings of the 2019 conference on human information interaction and retrieval, pages 163–171,

work page 2019

[7] [7]

Designing Supportive Conversational Agents With and For Teens

Irene Lopatovska and Jessika Davis. Designing Supportive Conversational Agents With and For Teens. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval, pages 328–332,

work page 2023

[8] [8]

I’m at my wits’ end

Selina Meyer. “I’m at my wits’ end”-Anticipating Information Needs and Appropriate Support Strategies in Behaviour Change. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval, pages 396–399,

work page 2022

[9] [9]

URL https://aclanthology.org/2023.eacl-main.53

Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.53. William R Miller and Stephen Rollnick. Motivational interviewing: Helping people change. Guilford press,

work page 2023

[10] [10]

Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction

Ashish Sharma, Kevin Rushton, Inna Wanyin Lin, David Wadden, Khendra G Lucas, Adam S Miner, Theresa Nguyen, and Tim Althoff. Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction. arXiv preprint arXiv:2305.02466, 2023a. Siqi Shen, Charles Welch, Rada Mihalcea, and Verónica Pérez-Rosas. Counseling-Style Reflection Generation Usi...

work page arXiv

[11] [11]

URL https://aclanthology.org/2020.sigdial-1.2

Association for Computational Linguistics. URL https://aclanthology.org/2020.sigdial-1.2. Ashish Sharma, Inna W Lin, Adam S Miner, David C Atkins, and Tim Althoff. Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1):46–57, 2023b. Emily M. Bender, Timnit Gebru, Angel...

work page 2020

[12] [12]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Association for Computing Machinery. ISBN 9781450383097. doi:10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922. Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375,

work page doi:10.1145/3442188.3445922

[13] [13]

Ayers, Adam Poliak, Mark Dredze, Eric C

ISSN 2168-6106. doi:10.1001/jamainternmed.2023.1838. URL https://jamanetwork.com/journals/jamainternalmedicine/articlepdf/2804309/jamainternal_ ayers_2023_oi_230030_1681999216.70842.pdf. Yanran Li, Ke Li, Hongke Ning, Xiaoqiang Xia, Yalong Guo, Chen Wei, Jianwei Cui, and Bin Wang. Towards an online empathetic chatbot with emotion causes. In Proceedings of...

work page doi:10.1001/jamainternmed.2023.1838 2023

[14] [14]

Most adults report making some changes to their lifestyle for environmen- tal reasons

Office for National Statistics. Most adults report making some changes to their lifestyle for environmen- tal reasons. URL https://www.ons.gov.uk/peoplepopulationandcommunity/wellbeing/articles/ mostadultsreportmakingsomechangestotheirlifestyleforenvironmentalreasons/2023-07-05. MGM Pinho, JD Mackenbach, Hélène Charreire, J-M Oppert, H Bárdos, K Glonti, H...

work page 2023

[15] [15]

Motivationally-driven

Selina Meyer. Natural Language Stage of Change Modelling for “Motivationally-driven” Weight Loss Support. In Proceedings of the 2021 International Conference on Multimodal Interaction, pages 807–811,

work page 2021

[16] [16]

I hear you, I feel you

Yi-Chieh Lee, Naomi Yamashita, Yun Huang, and Wai Fu. " I hear you, I feel you": encouraging deep self-disclosure through a chatbot. In Proceedings of the 2020 CHI conference on human factors in computing systems, pages 1–12,

work page 2020