pith. sign in

arxiv: 2401.16167 · v2 · submitted 2024-01-29 · 💻 cs.HC · cs.CL

"You tell me": A Dataset of GPT-4-Based Behaviour Change Support Conversations

Pith reviewed 2026-05-24 04:49 UTC · model grok-4.3

classification 💻 cs.HC cs.CL
keywords behavior changeconversational agentsGPT-4datasetuser studyLLMhuman-computer interactionmental health support
0
0 comments X

The pith

A dataset of user conversations with GPT-4 agents for behavior change support is released, containing transcripts, language analysis, perception measures, and feedback on AI turns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects and shares data from a preregistered user study in which participants held text-based conversations with two GPT-4-based agents on behavior change topics. The released materials include full conversation logs, breakdowns of user language, quantitative perception ratings, and direct comments on the quality of the model-generated replies. Earlier work on LLM counseling tools has centered on system performance while leaving aside how actual users speak and steer the exchanges. Access to these real interactions supplies concrete material for examining the interplay between user input and generated support text.

Core claim

The authors present a dataset gathered in a preregistered study of text conversations between users and two GPT-4 conversational agents built for behavior change support; the dataset comprises the raw dialogue records, linguistic annotations of user messages, user ratings of the agents, and written feedback on individual model turns.

What carries the argument

The dataset collected through the preregistered user study with two GPT-4 agents, which records actual user language, perceptions, and feedback during behavior change exchanges.

If this is right

  • System designers can examine the conversation logs to identify common user phrasing that influences how the model responds in support contexts.
  • The included language analysis highlights measurable features of user messages that correlate with particular perception scores.
  • Feedback comments on individual turns supply direct signals for refining prompt construction or response strategies.
  • Perception measures allow comparison of how users rate the two different GPT-4 agents within the same study protocol.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could serve as a benchmark for testing whether new behavior change agents reproduce the same user language patterns observed here.
  • Researchers might extend the work by running parallel studies with human counselors to isolate differences unique to LLM agents.
  • The feedback data might reveal prompt-level adjustments that reduce user frustration without requiring full retraining of the model.

Load-bearing premise

The recorded interactions are representative enough of everyday behavior change conversations that patterns found in them will apply to the design of future systems.

What would settle it

A controlled test in which behavior change systems built with guidance from this dataset show no measurable improvement in user retention or outcome metrics compared with systems built without the dataset would falsify its claimed utility.

Figures

Figures reproduced from arXiv: 2401.16167 by David Elsweiler, Selina Meyer.

Figure 1
Figure 1. Figure 1: Each user interacts with one of two systems, where one system is prompted to adhere to Motivational [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
read the original abstract

Conversational agents are increasingly used to address emotional needs on top of information needs. One use case of increasing interest are counselling-style mental health and behaviour change interventions, with large language model (LLM)-based approaches becoming more popular. Research in this context so far has been largely system-focused, foregoing the aspect of user behaviour and the impact this can have on LLM-generated texts. To address this issue, we share a dataset containing text-based user interactions related to behaviour change with two GPT-4-based conversational agents collected in a preregistered user study. This dataset includes conversation data, user language analysis, perception measures, and user feedback for LLM-generated turns, and can offer valuable insights to inform the design of such systems based on real interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a dataset of text-based user interactions with two GPT-4-based conversational agents for behaviour change support, collected via a preregistered user study. It includes conversation logs, user language analysis, perception measures, and feedback on LLM-generated turns, positioned as a resource to inform the design of future LLM-based behaviour change systems.

Significance. The dataset release addresses a noted gap between system-focused LLM research and user behaviour in counselling-style interventions. The preregistered study design is a clear strength that enhances credibility. Real-user data of this form can yield design-relevant observations even from a narrow sample, provided the collection process is fully documented.

major comments (1)
  1. [Abstract] Abstract: the statement that a preregistered user study was conducted supplies no sample size, inclusion criteria, agent prompts, or validation steps. These details are load-bearing for any claim that the dataset can supply useful signals for system design, as they determine the scope and reliability of the collected interactions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and constructive comment. We address the point raised regarding the abstract below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the statement that a preregistered user study was conducted supplies no sample size, inclusion criteria, agent prompts, or validation steps. These details are load-bearing for any claim that the dataset can supply useful signals for system design, as they determine the scope and reliability of the collected interactions.

    Authors: We agree that the abstract would be strengthened by briefly indicating key study parameters to help readers evaluate the dataset's scope. The full manuscript already details the preregistered protocol, sample size, inclusion criteria, agent prompts, and validation steps in the Methods section. In the revised version we will add a concise clause to the abstract (e.g., noting the sample size and that prompts and validation are described in the paper) while remaining within typical length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with no derivations or fitted claims

full rationale

The paper releases conversation logs, language analyses, perception measures, and feedback from a preregistered GPT-4 user study. No equations, parameters, predictions, or derivations appear in the abstract or described content. The central claim—that the dataset can supply design-relevant observations—does not reduce to any self-referential construction, self-citation chain, or fitted input renamed as output. This is a standard data contribution whose value is independent of statistical representativeness or modeling assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper rests on the premise that a preregistered user study with GPT-4 agents produced usable interaction data for system design; no free parameters, mathematical axioms, or new invented entities are introduced.

pith-pipeline@v0.9.0 · 5651 in / 1129 out tokens · 25175 ms · 2026-05-24T04:49:26.546710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Exploring language style in chatbots to increase perceived product value and user engagement

    Ela Elsholz, Jon Chamberlain, and Udo Kruschwitz. Exploring language style in chatbots to increase perceived product value and user engagement. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, pages 301–305,

  2. [2]

    Dataset of Natural Language Queries for E-Commerce

    Andrea Papenmeier, Dagmar Kern, Daniel Hienert, Alfred Sliwa, Ahmet Aker, and Norbert Fuhr. Dataset of Natural Language Queries for E-Commerce. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval, pages 307–311,

  3. [3]

    “Mhm...”–Conversational Strategies For Product Search Assistants

    Andrea Papenmeier, Alexander Frummet, and Dagmar Kern. “Mhm...”–Conversational Strategies For Product Search Assistants. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval, pages 36–46,

  4. [4]

    Conversational agents for recipe recommendation

    Sabrina Barko-Sherif, David Elsweiler, and Morgan Harvey. Conversational agents for recipe recommendation. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, pages 73–82,

  5. [5]

    Step-wise recommendation for complex task support

    Elnaz Nouri, Robert Sim, Adam Fourney, and Ryen W White. Step-wise recommendation for complex task support. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, pages 203–212,

  6. [6]

    Making meaning: A focus for information interactions research

    Ian Ruthven. Making meaning: A focus for information interactions research. In Proceedings of the 2019 conference on human information interaction and retrieval, pages 163–171,

  7. [7]

    Designing Supportive Conversational Agents With and For Teens

    Irene Lopatovska and Jessika Davis. Designing Supportive Conversational Agents With and For Teens. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval, pages 328–332,

  8. [8]

    I’m at my wits’ end

    Selina Meyer. “I’m at my wits’ end”-Anticipating Information Needs and Appropriate Support Strategies in Behaviour Change. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval, pages 396–399,

  9. [9]

    URL https://aclanthology.org/2023.eacl-main.53

    Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.53. William R Miller and Stephen Rollnick. Motivational interviewing: Helping people change. Guilford press,

  10. [10]

    Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction

    Ashish Sharma, Kevin Rushton, Inna Wanyin Lin, David Wadden, Khendra G Lucas, Adam S Miner, Theresa Nguyen, and Tim Althoff. Cognitive Reframing of Negative Thoughts through Human-Language Model Interaction. arXiv preprint arXiv:2305.02466, 2023a. Siqi Shen, Charles Welch, Rada Mihalcea, and Verónica Pérez-Rosas. Counseling-Style Reflection Generation Usi...

  11. [11]

    URL https://aclanthology.org/2020.sigdial-1.2

    Association for Computational Linguistics. URL https://aclanthology.org/2020.sigdial-1.2. Ashish Sharma, Inna W Lin, Adam S Miner, David C Atkins, and Tim Althoff. Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1):46–57, 2023b. Emily M. Bender, Timnit Gebru, Angel...

  12. [12]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Association for Computing Machinery. ISBN 9781450383097. doi:10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922. Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375,

  13. [13]

    Ayers, Adam Poliak, Mark Dredze, Eric C

    ISSN 2168-6106. doi:10.1001/jamainternmed.2023.1838. URL https://jamanetwork.com/journals/jamainternalmedicine/articlepdf/2804309/jamainternal_ ayers_2023_oi_230030_1681999216.70842.pdf. Yanran Li, Ke Li, Hongke Ning, Xiaoqiang Xia, Yalong Guo, Chen Wei, Jianwei Cui, and Bin Wang. Towards an online empathetic chatbot with emotion causes. In Proceedings of...

  14. [14]

    Most adults report making some changes to their lifestyle for environmen- tal reasons

    Office for National Statistics. Most adults report making some changes to their lifestyle for environmen- tal reasons. URL https://www.ons.gov.uk/peoplepopulationandcommunity/wellbeing/articles/ mostadultsreportmakingsomechangestotheirlifestyleforenvironmentalreasons/2023-07-05. MGM Pinho, JD Mackenbach, Hélène Charreire, J-M Oppert, H Bárdos, K Glonti, H...

  15. [15]

    Motivationally-driven

    Selina Meyer. Natural Language Stage of Change Modelling for “Motivationally-driven” Weight Loss Support. In Proceedings of the 2021 International Conference on Multimodal Interaction, pages 807–811,

  16. [16]

    I hear you, I feel you

    Yi-Chieh Lee, Naomi Yamashita, Yun Huang, and Wai Fu. " I hear you, I feel you": encouraging deep self-disclosure through a chatbot. In Proceedings of the 2020 CHI conference on human factors in computing systems, pages 1–12,