PSY-STEP: Structuring Therapeutic Targets and Action Sequences for Proactive Counseling Dialogue Systems

Gary Geunbae Lee; Hyounghun Kim; Jihyun Lee; SungJun Yang; Yejin Jeon; Yejin Min

arxiv: 2604.04448 · v1 · submitted 2026-04-06 · 💻 cs.AI

PSY-STEP: Structuring Therapeutic Targets and Action Sequences for Proactive Counseling Dialogue Systems

Jihyun Lee , Yejin Min , Yejin Jeon , SungJun Yang , Hyounghun Kim , Gary Geunbae Lee This is my paper

Pith reviewed 2026-05-10 20:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords CBT counselingdialogue systemsautomatic thoughtsproactive agentspreference learningtherapeutic sequencesAI counseling agents

0 comments

The pith

Modeling automatic negative thoughts within dynamic counseling sequences allows AI agents to conduct proactive and clinically grounded CBT dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cognitive Behavioral Therapy depends on spotting and reframing automatic negative thoughts, yet dialogue agents typically miss them during live exchanges. The paper builds the STEP dataset to record these thoughts explicitly alongside sequences of therapeutic actions. STEPPER is then trained on the dataset to draw out the thoughts proactively and carry out matching interventions. Preference learning from simulated sessions sharpens both accuracy and empathy. Evaluations find the result more clinically aligned, coherent, and personalized than baselines, with stronger counselor competence and no added emotional disruption.

Core claim

The paper claims that encoding automatic thoughts and action-level counseling sequences in the STEP dataset, then training STEPPER with preference learning on simulated sessions, produces a dialogue agent that proactively elicits thoughts and delivers cognitively grounded interventions, outperforming baselines on clinical grounding, coherence, personalization, and competence without increasing emotional disruption.

What carries the argument

The STEP dataset, which pairs automatic thoughts with dynamic sequences of counseling actions, enabling the STEPPER agent to learn proactive elicitation and intervention.

If this is right

Counseling agents shift from reactive to proactive identification of cognitive distortions in ongoing dialogue.
Preference optimization on simulations raises both decision accuracy and empathic quality simultaneously.
Counseling outputs gain personalization and coherence while remaining clinically grounded.
Competence increases without raising measured emotional disruption.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If simulation quality improves further, training data needs for therapeutic agents could drop sharply.
The same structure of thoughts plus action sequences could be adapted to other therapy schools beyond CBT.
Deployment testing on varied client groups would be required to check whether gains hold outside the simulated distribution.

Load-bearing premise

Simulated and synthesized counseling sessions used for preference learning accurately capture real human responses and therapeutic dynamics.

What would settle it

A trial in which STEPPER conducts sessions with actual clients and is scored against human counselors on standardized CBT fidelity and outcome scales.

Figures

Figures reproduced from arXiv: 2604.04448 by Gary Geunbae Lee, Hyounghun Kim, Jihyun Lee, SungJun Yang, Yejin Jeon, Yejin Min.

**Figure 2.** Figure 2: Overview of the PSY-STEP dataset construction and structured CBT counseling flow. The figure illustrates how client profiles are modeled, how surface-level problems and automatic thoughts are elicited during the diagnostic stage, and how structured action sequences guide therapeutic interventions through stepwise CBT reasoning. utilized as the primary source, in which human annotators assign negative thoug… view at source ↗

**Figure 3.** Figure 3: Illustration of the simulation-based process [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Preference comparisons of STEPPER, conducted with Gemini-based clients and evaluators. peutic alliance from the client’s perspective. To further examine these trends, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Correlation between overall human preference [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Cognitive Behavioral Therapy (CBT) aims to identify and restructure automatic negative thoughts pertaining to involuntary interpretations of events, yet existing counseling agents struggle to identify and address them in dialogue settings. To bridge this gap, we introduce STEP, a dataset that models CBT counseling by explicitly reflecting automatic thoughts alongside dynamic, action-level counseling sequences. Using this dataset, we train STEPPER, a counseling agent that proactively elicits automatic thoughts and executes cognitively grounded interventions. To further enhance both decision accuracy and empathic responsiveness, we refine STEPPER through preference learning based on simulated, synthesized counseling sessions. Extensive CBT-aligned evaluations show that STEPPER delivers more clinically grounded, coherent, and personalized counseling compared to other strong baseline models, and achieves higher counselor competence without inducing emotional disruption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a CBT-focused dataset and proactive agent but its claims rest on thin evaluation details and unvalidated synthetic sessions.

read the letter

The main takeaway is that this work introduces STEP, a dataset that pairs automatic thoughts with counseling action sequences, and STEPPER, an agent trained to elicit those thoughts and run cognitively grounded interventions, then tuned via preference learning on simulated dialogues. That is the concrete addition beyond generic counseling chatbots. The framing around why existing agents miss involuntary negative interpretations is straightforward and the proactive angle fits CBT logic without overclaiming. The dataset itself could be a usable resource for others if released with clear documentation. The soft spots sit in the evidence. The abstract says extensive CBT-aligned evaluations show gains in clinical grounding, coherence, personalization, and counselor competence without emotional disruption, yet it supplies no metrics, no baseline comparisons, no statistical tests, and no controls. That leaves the size of any improvement impossible to judge from the given text. The bigger issue is the preference learning step, which depends on simulated and synthesized sessions. Nothing in the abstract shows that the synthesis process reproduces real client ambivalence, resistance, or emotional escalation patterns, so the learned policy could be fitting artifacts rather than actual therapeutic dynamics. If the synthetic data is too cooperative or simplified, generalization to real users stays unproven. This is the kind of paper that belongs in the AI-for-mental-health corner of dialogue systems research. Readers building specialized agents or datasets might extract the STEP structure or the proactive elicitation idea, but anyone needing reliable performance numbers will have to wait for the full results. It is worth sending to peer review because the core idea is specific and the dataset contribution is distinct enough to check, though the referee will almost certainly ask for the missing evaluation details and a validation study on the synthetic data.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the STEP dataset for modeling CBT counseling dialogues by explicitly capturing automatic thoughts alongside dynamic action-level counseling sequences. It proposes the STEPPER agent, trained on this dataset to proactively elicit automatic thoughts and execute cognitively grounded interventions, which is further refined via preference learning over simulated and synthesized counseling sessions. The central claim is that extensive CBT-aligned evaluations demonstrate STEPPER outperforms strong baseline models in clinical grounding, coherence, personalization, and counselor competence without inducing emotional disruption.

Significance. If the evaluations are robust and the synthetic data faithfully represents therapeutic dynamics, the work could meaningfully advance proactive counseling dialogue systems by embedding explicit CBT structures for automatic thoughts, offering a scalable path to more competent and safe AI-assisted therapy agents through preference optimization.

major comments (2)

[Abstract] Abstract: The claim that 'extensive CBT-aligned evaluations show that STEPPER delivers more clinically grounded, coherent, and personalized counseling' provides no information on the concrete metrics (e.g., competence scores, coherence ratings), baseline models, statistical tests, or controls employed. This absence directly undermines verification of the headline performance claims.
The preference-learning stage (described after dataset introduction): Optimization is performed exclusively on 'simulated, synthesized counseling sessions,' yet no evidence or validation is supplied that these LLM-generated sessions reproduce key real-therapy distributions such as client ambivalence, resistance, or emotional escalation. If the synthetic data systematically deviates, the learned policy may optimize for artifacts rather than genuine therapeutic efficacy, rendering generalization claims unsupported.

minor comments (1)

[Abstract] The title uses 'PSY-STEP' while the abstract and body refer to 'STEP' and 'STEPPER'; a brief clarification of the naming convention would reduce potential reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'extensive CBT-aligned evaluations show that STEPPER delivers more clinically grounded, coherent, and personalized counseling' provides no information on the concrete metrics (e.g., competence scores, coherence ratings), baseline models, statistical tests, or controls employed. This absence directly undermines verification of the headline performance claims.

Authors: We agree with this observation. The original abstract was intentionally concise, but we recognize that it lacks sufficient detail for readers to assess the claims. In the revised version, we have expanded the abstract to include specific evaluation metrics such as clinical grounding scores, coherence ratings, and counselor competence measures, along with the baseline models used and references to the statistical analyses performed. These elements are detailed in the Experiments and Evaluation sections of the manuscript. revision: yes
Referee: The preference-learning stage (described after dataset introduction): Optimization is performed exclusively on 'simulated, synthesized counseling sessions,' yet no evidence or validation is supplied that these LLM-generated sessions reproduce key real-therapy distributions such as client ambivalence, resistance, or emotional escalation. If the synthetic data systematically deviates, the learned policy may optimize for artifacts rather than genuine therapeutic efficacy, rendering generalization claims unsupported.

Authors: This is a valid concern regarding the fidelity of our synthetic data. The synthesized sessions were generated using prompts informed by the STEP dataset and CBT principles to incorporate elements like client ambivalence and resistance. However, we did not perform a quantitative validation comparing the distributions of these sessions to real therapy data for aspects such as emotional escalation. We have revised the manuscript to include an explicit Limitations section that acknowledges this gap, discusses the potential implications for generalization, and suggests future directions involving real client data for validation. Our current evaluations, which include checks for emotional disruption, provide supporting evidence for the safety and grounding of the resulting policy. revision: partial

Circularity Check

0 steps flagged

No significant circularity; new dataset and standard preference learning

full rationale

The paper introduces a new STEP dataset explicitly modeling CBT elements (automatic thoughts + action sequences), trains STEPPER on it, and applies standard preference optimization over simulated sessions. All claims rest on empirical comparisons to baselines using CBT-aligned metrics. No derivation reduces by construction to fitted inputs, self-definitions, or self-citation chains; the central results are obtained from independent training and evaluation steps rather than tautological mappings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; central claim rests on domain assumptions about CBT structure and simulated data fidelity rather than new axioms or entities. No free parameters or invented entities are specified.

axioms (1)

domain assumption Simulated synthesized counseling sessions can serve as valid proxies for real therapeutic interactions in preference learning
Invoked to refine STEPPER via preference learning on fake sessions.

pith-pipeline@v0.9.0 · 5442 in / 1159 out tokens · 67596 ms · 2026-05-10T20:18:04.449357+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, and Yejin Choi

Mixed-session conversation with egocentric memory.arXiv preprint arXiv:2410.02503. Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, and Yejin Choi. 2023. SODA: Million-scale dialogue dis- tillation with social commonsense contextualization. InProceedings of the 2023 ...

work page arXiv 2023
[2]

Gemini: A Family of Highly Capable Multimodal Models

Session reactions scale-3: initial psychometric evidence.Psychotherapy Research, 34(4):434–448. Eric Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, and Jason Weston. 2022. Human evaluation of conversations is an open problem: com- paring the sensitivity of various methods for eval- uating dialogue agents. InProceedings of the 4th Workshop ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

I’m not sure

framework. We only include strategies that can be effectively implemented through dialogue- based counseling, and exclude techniques that re- quire non-conversational components. B.4 Plan and Action Examples Table 13 presents representative examples of CBT plans generated from clients’ surface-level prob- lems, triggering situations, and automatic thought...

work page 2024
[4]

2: Therapist elicited some feedback but did not sufficiently check understanding or satisfaction

Feedback0: Therapist did not ask for feedback to determine the patient’s understanding or response. 2: Therapist elicited some feedback but did not sufficiently check understanding or satisfaction. 4: Therapist asked enough questions to ensure understanding and adjusted accordingly. 6: Therapist was especially adept at eliciting and responding to feedback...

work page
[5]

2: Understood explicit content but missed subtle communication

Understanding0: Therapist repeatedly failed to understand explicit content; poor empathy. 2: Understood explicit content but missed subtle communication. 4: Generally grasped the patient’s internal reality. 6: Thoroughly understood and communicated the patient’s internal reality. 1/3/5: Between two adjacent descriptors

work page
[6]

2: Interpersonal problems (impatient, aloof, insincere)

Interpersonal Effectiveness0: Hostile, demeaning, or destructive. 2: Interpersonal problems (impatient, aloof, insincere). 4: Satisfactory warmth, confidence, and professionalism. 6: Optimal interpersonal effectiveness for this patient. 1/3/5: Between two adjacent descriptors

work page
[7]

2: Attempted but failed to establish rapport or shared focus

Collaboration0: No attempt at collaboration. 2: Attempted but failed to establish rapport or shared focus. 4: Collaborated well on an important problem. 6: Encouraged the patient to function as an active team member. 1/3/5: Between two adjacent descriptors

work page
[8]

2: Overused persuasion with supportive tone

Guided_discovery0: Relied on debate, persuasion, or lecturing. 2: Overused persuasion with supportive tone. 4: Used guided discovery appropriately. 6: Excellent balance of questioning and intervention. 1/3/5: Between two adjacent descriptors

work page
[9]

2: Focused on irrelevant or unfocused areas

Focusing0: Did not attempt to elicit specific cognitions or behaviors. 2: Focused on irrelevant or unfocused areas. 4: Focused on relevant cognitions or behaviors. 6: Skillfully focused on key targets with high potential for progress. 1/3/5: Between two adjacent descriptors

work page
[10]

2: Strategy vague or unpromising

Strategy0: No CBT techniques selected. 2: Strategy vague or unpromising. 4: Coherent and reasonable CBT strategy. 6: Highly promising and optimally selected CBT strategy. 1/3/5: Between two adjacent descriptors

work page
[11]

Feedback

CBTtechniques (Application)0: No CBT techniques applied. 2: CBT techniques applied with major flaws. 4: CBT techniques applied with moderate skill. 6: CBT techniques applied very skillfully. 1/3/5: Between two adjacent descriptors. Session Transcript The following is the session transcript. Donotsummarize or rewrite it. {history} Output Format (JSON only)...

work page
[12]

Clinical_Appropriateness Definition: Evaluate how clinically appropriate and therapeutically grounded thePLANis. Consider: • Whether the plan correctly identifies the client’s emotional and cognitive patterns • Consistency with CBT / PFA / ACT principles • Whether therapeutic goals are reasonable, specific, and safe • The degree to which the plan reflects...

work page
[13]

Plan_Action_Alignment Definition: Evaluate how well theACTION LISTexpands and operationalizes thePLAN. Consider: • Whether actions are directly derived from the plan’s therapeutic intentions • Logical expansion rather than deviation from the plan • Concreteness, actionability, and clinical meaningfulness • Fidelity to the plan’s core structure Scoring Gui...

work page
[14]

Clinical_Appropriateness

Dialogue_Adherence Definition: Evaluate how wellDIAL2adheres to thePLANandACTION LIST. Consider: • Whether the counselor follows the intended therapeutic direction • Whether actions are executed in a natural and coherent order • Reflection of the plan’s priorities and stepwise structure • Consistency of interventions with the defined approach Scoring Guid...

work page

[1] [1]

Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, and Yejin Choi

Mixed-session conversation with egocentric memory.arXiv preprint arXiv:2410.02503. Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, and Yejin Choi. 2023. SODA: Million-scale dialogue dis- tillation with social commonsense contextualization. InProceedings of the 2023 ...

work page arXiv 2023

[2] [2]

Gemini: A Family of Highly Capable Multimodal Models

Session reactions scale-3: initial psychometric evidence.Psychotherapy Research, 34(4):434–448. Eric Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, and Jason Weston. 2022. Human evaluation of conversations is an open problem: com- paring the sensitivity of various methods for eval- uating dialogue agents. InProceedings of the 4th Workshop ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

I’m not sure

framework. We only include strategies that can be effectively implemented through dialogue- based counseling, and exclude techniques that re- quire non-conversational components. B.4 Plan and Action Examples Table 13 presents representative examples of CBT plans generated from clients’ surface-level prob- lems, triggering situations, and automatic thought...

work page 2024

[4] [4]

2: Therapist elicited some feedback but did not sufficiently check understanding or satisfaction

Feedback0: Therapist did not ask for feedback to determine the patient’s understanding or response. 2: Therapist elicited some feedback but did not sufficiently check understanding or satisfaction. 4: Therapist asked enough questions to ensure understanding and adjusted accordingly. 6: Therapist was especially adept at eliciting and responding to feedback...

work page

[5] [5]

2: Understood explicit content but missed subtle communication

Understanding0: Therapist repeatedly failed to understand explicit content; poor empathy. 2: Understood explicit content but missed subtle communication. 4: Generally grasped the patient’s internal reality. 6: Thoroughly understood and communicated the patient’s internal reality. 1/3/5: Between two adjacent descriptors

work page

[6] [6]

2: Interpersonal problems (impatient, aloof, insincere)

Interpersonal Effectiveness0: Hostile, demeaning, or destructive. 2: Interpersonal problems (impatient, aloof, insincere). 4: Satisfactory warmth, confidence, and professionalism. 6: Optimal interpersonal effectiveness for this patient. 1/3/5: Between two adjacent descriptors

work page

[7] [7]

2: Attempted but failed to establish rapport or shared focus

Collaboration0: No attempt at collaboration. 2: Attempted but failed to establish rapport or shared focus. 4: Collaborated well on an important problem. 6: Encouraged the patient to function as an active team member. 1/3/5: Between two adjacent descriptors

work page

[8] [8]

2: Overused persuasion with supportive tone

Guided_discovery0: Relied on debate, persuasion, or lecturing. 2: Overused persuasion with supportive tone. 4: Used guided discovery appropriately. 6: Excellent balance of questioning and intervention. 1/3/5: Between two adjacent descriptors

work page

[9] [9]

2: Focused on irrelevant or unfocused areas

Focusing0: Did not attempt to elicit specific cognitions or behaviors. 2: Focused on irrelevant or unfocused areas. 4: Focused on relevant cognitions or behaviors. 6: Skillfully focused on key targets with high potential for progress. 1/3/5: Between two adjacent descriptors

work page

[10] [10]

2: Strategy vague or unpromising

Strategy0: No CBT techniques selected. 2: Strategy vague or unpromising. 4: Coherent and reasonable CBT strategy. 6: Highly promising and optimally selected CBT strategy. 1/3/5: Between two adjacent descriptors

work page

[11] [11]

Feedback

CBTtechniques (Application)0: No CBT techniques applied. 2: CBT techniques applied with major flaws. 4: CBT techniques applied with moderate skill. 6: CBT techniques applied very skillfully. 1/3/5: Between two adjacent descriptors. Session Transcript The following is the session transcript. Donotsummarize or rewrite it. {history} Output Format (JSON only)...

work page

[12] [12]

Clinical_Appropriateness Definition: Evaluate how clinically appropriate and therapeutically grounded thePLANis. Consider: • Whether the plan correctly identifies the client’s emotional and cognitive patterns • Consistency with CBT / PFA / ACT principles • Whether therapeutic goals are reasonable, specific, and safe • The degree to which the plan reflects...

work page

[13] [13]

Plan_Action_Alignment Definition: Evaluate how well theACTION LISTexpands and operationalizes thePLAN. Consider: • Whether actions are directly derived from the plan’s therapeutic intentions • Logical expansion rather than deviation from the plan • Concreteness, actionability, and clinical meaningfulness • Fidelity to the plan’s core structure Scoring Gui...

work page

[14] [14]

Clinical_Appropriateness

Dialogue_Adherence Definition: Evaluate how wellDIAL2adheres to thePLANandACTION LIST. Consider: • Whether the counselor follows the intended therapeutic direction • Whether actions are executed in a natural and coherent order • Reflection of the plan’s priorities and stepwise structure • Consistency of interventions with the defined approach Scoring Guid...

work page