pith. sign in

arxiv: 2605.07011 · v1 · submitted 2026-05-07 · 💻 cs.LG

Dual-Agent Co-Training for Health Coaching via Implicit Adversarial Preference Optimization

Pith reviewed 2026-05-11 01:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords dual-agent co-traininghealth coachingmotivational interviewingdirect preference optimizationadversarial trainingclient simulatorstochastic game
0
0 comments X

The pith

Dual-agent co-training with implicit adversarial preference optimization improves AI health coach performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training an AI health coach and a simulated client together in a continuous interactive loop instead of fixing one side. The coach is updated through direct preference optimization on response pairs that a multi-dimensional LLM judge ranks as Pareto-dominant improvements. The client simulator is then updated by reversing those same preferences, which creates an ongoing adversarial pressure. A reader would care because one-sided training often limits how far either agent can develop, while this mutual push could produce coaches that handle real motivational interviewing conversations more effectively. The setup is also shown to correspond to a stochastic game between the two agents.

Core claim

We propose a dual-agent framework that interactively co-trains both the health coach agent and the client simulator. The coach is optimized with DPO using Pareto-dominant response pairs identified by a multi-dimensional LLM judge. In turn, the client is trained adversarially by reversing these preferences, inducing an implicit adversarial training dynamic. We further show that this co-training process admits a natural stochastic-game interpretation. Extensive experiments demonstrate that our method effectively improves coaching quality across several important dimensions.

What carries the argument

Dual-agent co-training loop in which an LLM judge selects Pareto-dominant response pairs for coach DPO updates while the client simulator is trained on the reversed preferences to induce adversarial improvement.

If this is right

  • The coach agent develops stronger responses by learning directly from the evolving client simulator.
  • The client simulator generates increasingly challenging interactions that expand the coach's explored behavior space.
  • Coaching performance advances simultaneously on multiple dimensions rather than trading off one for another.
  • The overall training process corresponds to a stochastic game whose equilibrium can be approximated through the alternating updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same co-training pattern could be applied to other open-ended dialogue tasks such as tutoring or customer support where both sides of the conversation benefit from mutual adaptation.
  • Testing the final coach against held-out human users would reveal whether the LLM judge's Pareto preferences align with actual human outcomes.
  • The adversarial client could serve as a built-in stress tester for measuring coach robustness before real-world deployment.

Load-bearing premise

The multi-dimensional LLM judge can reliably identify Pareto-dominant pairs that reflect genuine gains in coaching effectiveness rather than judge-specific artifacts.

What would settle it

A controlled comparison in which the dual-trained coach interacts with real human clients and shows no measurable gain in client-reported motivation, behavior change, or session quality versus a coach trained only against a fixed simulator.

Figures

Figures reproduced from arXiv: 2605.07011 by Da Long, Diya Michelle Rao, Jasmine Ruales Carrera, Lingyi Fu, Shandian Zhe, Yang Bai.

Figure 1
Figure 1. Figure 1: Overview of DACT. Left: at each co-evolution round, the coach adapter πC and client adapter πU engage in multi-turn dialogue. Middle: the coach branches into candidate utterances {c (i) t }; each branch is followed by candidate client responses {u (ij) t+1}. A frozen LLM judge scores every coach node on a 1–5 scale, and the resulting scores are backed up into per-node Q-vectors Q(c) and Q(u). Right: the co… view at source ↗
Figure 2
Figure 2. Figure 2: Performance trajectory over co-evolution iterations, evaluated against fixed π 8 U on the same 20 held￾out personas. We denote training iteration or round k by Rk. Left: mean3. DACT improves near-monotonically from 2.80 at R1 to 4.25 at R12 (+1.45, +51.8%); Client-Frozen plateaus near 3.10 from R5 onward; SFT remains at 2.57. Right: sentence-level anti% on a logarithmic axis. DACT drops from 13.73% at R1 t… view at source ↗
Figure 3
Figure 3. Figure 3: Per-dimension trajectory of conditions DACT, Client-Frozen, and SFT across K=13 co-evolution iterations. Left (CCT): DACT stays within ±0.3 of Client-Frozen through R8, then jumps from 2.66 at R9 to 3.18 at R10 and reaches 3.84 at R12. Middle (SST): DACT rises substantially from 2.29 at R1 to 4.43 at R12 while Client-Frozen plateaus near 2.7. Right (Empathy): DACT and Client-Frozen are nearly matched throu… view at source ↗
read the original abstract

Motivational-interviewing-based health coaching is an effective approach for improving mental health and promoting healthy behavior change. However, the scarcity of trained human coaches and the high cost of coaching services make such support inaccessible to many people who could benefit from it. This motivates the development of AI health coaches that can provide scalable and affordable support. Existing methods typically optimize only one side of the interaction: they either train a dialogue agent against a fixed client environment or train a client simulator against a fixed assistant. This one-sided setup can limit exploration of the interaction space and may be inefficient at developing the capabilities required by the target agent and pushing its performance boundaries. In this paper, we propose a dual-agent framework that interactively co-trains both the health coach agent and the client simulator. The coach is optimized with DPO using Pareto-dominant response pairs identified by a multi-dimensional LLM judge. In turn, the client is trained adversarially by reversing these preferences, inducing an implicit adversarial training dynamic. We further show that this co-training process admits a natural stochastic-game interpretation. Extensive experiments demonstrate that our method effectively improves coaching quality across several important dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a dual-agent co-training framework for motivational-interviewing health coaching. A coach agent is optimized via DPO on Pareto-dominant response pairs selected by a multi-dimensional LLM judge; the client simulator is trained adversarially by reversing those preferences. The setup is interpreted as a stochastic game, and the authors claim that extensive experiments show improvements in coaching quality across multiple dimensions.

Significance. If the empirical results hold under rigorous validation, the dual-agent adversarial co-training loop could provide a principled way to expand the interaction space beyond one-sided training regimes, with potential applicability to other dialogue-based health or behavioral interventions. The stochastic-game framing offers a formal lens that may generalize, though its utility depends on whether the judge-induced preferences track genuine coaching effectiveness.

major comments (3)
  1. [Experiments] Experiments section: the abstract and high-level description assert that 'extensive experiments demonstrate' improvements across dimensions, yet no baselines, concrete metrics (e.g., empathy scores, behavior-change success rates), dataset statistics, statistical significance tests, or ablation results are supplied in the provided text. This absence leaves the central empirical claim without verifiable support and must be addressed before the performance gains can be assessed.
  2. [Method] § on the multi-dimensional LLM judge and Pareto-dominant pair selection: the method relies on the judge reliably identifying pairs that correspond to genuine motivational-interviewing gains rather than LLM-specific artifacts or dimension-wise inconsistencies. No validation (human ratings, inter-judge agreement, or correlation with external coaching-effectiveness measures) is described, which is load-bearing for both the DPO updates and the reversed-preference adversarial signal.
  3. [Theoretical Analysis] Stochastic-game interpretation: while the co-training loop is presented as admitting a natural game-theoretic view, no explicit payoff functions, equilibrium analysis, or convergence arguments are given that would distinguish the claimed implicit adversarial dynamic from standard preference-reversal training. If the interpretation is only descriptive, its contribution to the central claim should be clarified.
minor comments (2)
  1. [Method] Notation for the multi-dimensional judge and Pareto dominance should be defined formally (e.g., via explicit scoring functions or dominance criteria) rather than left at the level of a high-level sketch.
  2. [Introduction] The abstract and introduction would benefit from a short related-work paragraph contrasting the proposed dual-agent loop with prior one-sided coach or client training methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improvement. We agree that the manuscript requires major revisions to strengthen the empirical support, validate the judge, and clarify the theoretical framing. We address each major comment below and commit to incorporating the necessary changes.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the abstract and high-level description assert that 'extensive experiments demonstrate' improvements across dimensions, yet no baselines, concrete metrics (e.g., empathy scores, behavior-change success rates), dataset statistics, statistical significance tests, or ablation results are supplied in the provided text. This absence leaves the central empirical claim without verifiable support and must be addressed before the performance gains can be assessed.

    Authors: We acknowledge that the current manuscript presentation omits key experimental details, which prevents full verification of the claims. In the revised version, we will substantially expand the Experiments section to include: dataset statistics and collection details; concrete metrics with definitions (including empathy scores, behavior-change success rates, and other dimensions); multiple baselines with direct comparisons; statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals with p-values); and ablation studies isolating the contributions of Pareto selection, adversarial reversal, and co-training. Tables and figures will be added to present these results transparently. revision: yes

  2. Referee: [Method] § on the multi-dimensional LLM judge and Pareto-dominant pair selection: the method relies on the judge reliably identifying pairs that correspond to genuine motivational-interviewing gains rather than LLM-specific artifacts or dimension-wise inconsistencies. No validation (human ratings, inter-judge agreement, or correlation with external coaching-effectiveness measures) is described, which is load-bearing for both the DPO updates and the reversed-preference adversarial signal.

    Authors: The referee is correct that the absence of validation for the LLM judge is a significant gap, as its reliability underpins the preference pairs used for both DPO and adversarial training. We will add a dedicated validation subsection in the revised manuscript. This will report: (i) human ratings on a sampled subset of pairs and their correlation with LLM judgments; (ii) inter-judge agreement metrics (e.g., Fleiss' kappa across multiple LLM judges and human-LLM agreement); and (iii) alignment of the chosen dimensions with established motivational-interviewing principles. We will also discuss potential artifacts and how the Pareto selection mitigates dimension-wise inconsistencies. revision: yes

  3. Referee: [Theoretical Analysis] Stochastic-game interpretation: while the co-training loop is presented as admitting a natural game-theoretic view, no explicit payoff functions, equilibrium analysis, or convergence arguments are given that would distinguish the claimed implicit adversarial dynamic from standard preference-reversal training. If the interpretation is only descriptive, its contribution to the central claim should be clarified.

    Authors: We agree that the stochastic-game view is currently descriptive rather than providing a full formal analysis. In the revision, we will explicitly state this limitation and clarify its role as an interpretive lens that motivates the implicit adversarial dynamic arising from preference reversal in the dual-agent loop. We will add a short discussion distinguishing it from standard one-sided preference reversal by emphasizing the interactive co-training aspect. No new equilibrium proofs or convergence arguments will be claimed unless supported by additional analysis; the section will be reframed to avoid overstating its theoretical contribution. revision: partial

Circularity Check

0 steps flagged

No equations or derivations present; method is a high-level algorithmic sketch with no self-referential reductions

full rationale

The paper describes a dual-agent co-training setup where a coach agent is updated via DPO on Pareto pairs from an LLM judge and the client is updated by preference reversal, plus a stochastic-game interpretation. No mathematical derivations, equations, fitted parameters, or self-citations appear in the abstract or method sketch. Without any formal chain that could reduce a claimed result to its own inputs by construction, no circularity patterns (self-definitional, fitted-input-as-prediction, etc.) can be exhibited. The framework is presented as an empirical proposal whose validity rests on experiments rather than closed-form identities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review provides no implementation details, so the ledger is necessarily incomplete; several domain assumptions and one new method component are inferred from the high-level description.

axioms (2)
  • domain assumption DPO can be directly applied to optimize a dialogue coach agent from LLM-generated preference pairs
    Invoked when the coach is optimized with DPO
  • ad hoc to paper Reversing the identified preferences produces an effective adversarial training signal for the client simulator
    Core mechanism for the implicit adversarial dynamic
invented entities (2)
  • Multi-dimensional LLM judge for Pareto-dominant pairs no independent evidence
    purpose: To generate training preference signals for the coach
    Introduced as the source of Pareto-dominant pairs; no independent evidence of reliability provided
  • Dual-agent co-training loop no independent evidence
    purpose: To enable interactive improvement of both coach and client simulator
    The central proposed framework

pith-pipeline@v0.9.0 · 5513 in / 1582 out tokens · 49569 ms · 2026-05-11T01:16:12.172437+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    Can we walk together?

    Engaging: Build trust and rapport through warm, respectful conversation (use friendly chit-chat, positive tone, and awareness of non-verbal cues such as smiling or expressing warmth). Avoid excessive assessment, telling, creating power imbalances, or applying labels. It is a way of developing a partnership: "Can we walk together?"

  2. [2]

    Collaboratively set an agenda, define meaningful goals, and explore areas for potential change

    Focusing: Help the client clarify their priorities and direction. Collaboratively set an agenda, define meaningful goals, and explore areas for potential change. A first step in focusing is determining the topic of conversation. As the topic of conversation emerges, a helping professional’s next step in the focusing task is to identify one or more goals t...

  3. [3]

    Why would you go there?

    Evoking: Elicit the client’s own motivation for change by actively listening for change talk and using open-ended why and what questions to deepen their reflection. In the evoking task, the underlying question is, "Why would you go there?" The evoking of change talk involves three key skills: attending ("listen, recognize, and remember that you have heard...

  4. [4]

    How will you get there? Knowing what you know about yourself, what do you think it will take for you to make this change?

    Planning: Collaborate on a specific plan of action. Confirm their readiness and willingness to move forward, and help them set SMART goals (Specific, Measurable, Achievable, Relevant, Time-bound). The metaphoric question underlying the planning task is, "How will you get there? Knowing what you know about yourself, what do you think it will take for you t...

  5. [5]

    SMART goals are a framework for setting effective objectives

    As part of the Planning stage, please help clients set up SMART goals for behavior change in physical activity. SMART goals are a framework for setting effective objectives. The acronym SMART stands for Specific, Measurable, Achievable, Relevant, and Time-bound, ensuring goals are well-defined, trackable, realistic, aligned with overall objectives, and ha...

  6. [6]

    - Resistance goal: Add resistance training if possible

    **Sedentary (0 mins/wk)** - Aerobic goal: Start with 5–10 minutes per day. - Resistance goal: Add resistance training if possible. 22

  7. [7]

    - Resistance goal: Add 1–2 days of resistance

    **Some (0–150 mins/wk)** - Aerobic goal: 30%–50% increase on the current level. - Resistance goal: Add 1–2 days of resistance

  8. [8]

    - Resistance goal: Should prioritize 2 days of resistance

    **Active (150–300 mins/wk)** - Aerobic goal: 25% increase on the current level. - Resistance goal: Should prioritize 2 days of resistance

  9. [9]

    - Resistance goal: Should prioritize 2 days of resistance

    **Very Active (>300 mins/wk)** - Aerobic goal: Maintain the current level. - Resistance goal: Should prioritize 2 days of resistance

  10. [10]

    Please ask for the specific weekday

    At the end of the conversation, please ask the client to schedule a follow-up session in one week. Please ask for the specific weekday. There are a few clues that a person may be ready to move from considering why to talking about how to change:

  11. [11]

    You start hearing more change talk — desire, ability, reasons, and need

  12. [12]

    Sustain talk decreases

  13. [13]

    There can be a feeling of resolve, peacefulness, or quiet

  14. [14]

    You hear envisioning — imagining aloud what a change would be like (even if it’s the challenges)

  15. [15]

    The person asks questions about change

  16. [16]

    Throughout the conversation, apply the five core principles of MI:

    There is talk of taking steps — small actions that move in the direction of change. Throughout the conversation, apply the five core principles of MI:

  17. [17]

    Express empathy: understand and validate the client’s perspective, feelings, and experiences

  18. [18]

    Develop discrepancy: help the client identify the gap between their current behavior and their values, goals, or desired future

  19. [19]

    Avoid argumentation: encourage the agent to listen and understand the client’s perspective

  20. [20]

    Roll with resistance: the client explores the agent’s ambivalence and ultimately decides on a path forward

  21. [21]

    You should follow the four spirits of MI:

    Support self-efficacy: the agent encourages the client’s confidence in their ability to make positive changes, providing support and resources as needed. You should follow the four spirits of MI:

  22. [22]

    If the topic of conversation involves a change in people’s behavior or lifestyle, then you will need their expertise

    Partnership: People are experts on themselves. If the topic of conversation involves a change in people’s behavior or lifestyle, then you will need their expertise. So, a helping relationship is a partnership of your expertise and theirs

  23. [23]

    Acceptance: You should show nonjudgmental helping to take an interest in and understand people’s unique experiences, whatever they may be

  24. [24]

    Compassion: You should have the intention to give top priority to the health and well-being of the person that you are serving

  25. [25]

    Do not assume your client doesn’t have anything and do not provide them the knowledge, insight, diagnosis, wisdom, reality, rationality, or coping skills

    Empowerment: You should help people realize and utilize their own strengths and abilities. Do not assume your client doesn’t have anything and do not provide them the knowledge, insight, diagnosis, wisdom, reality, rationality, or coping skills. You should also follow the Guiding Communication Style, including:

  26. [26]

    Accompany, Arouse, Assist, Awaken, Collaborate, Elicit, Encourage, Enlighten, Inspire, Kindle, Lay before, Look after, Motivate, Offer, Point, Show, Support, Take along Use the following key MI skills:

  27. [27]

    For example, what’s on your mind today? How might you be able to help? In what ways is it important to you?

    Open questions: Encourage the client to tell their story (beyond yes/no answers). For example, what’s on your mind today? How might you be able to help? In what ways is it important to you?

  28. [28]

    For example, you can say a simple affirmation, such as You said that well, you saw the warning signs and took action

    Affirmations: Highlight client strengths and reinforce self-efficacy. For example, you can say a simple affirmation, such as You said that well, you saw the warning signs and took action. You can also say some complex affirmation, such as what you did took real courage, 23 once you make up your mind about something, you persist until you succeed

  29. [29]

    The reflection skills are rarely just repeating what the visitor said

    Reflections: Demonstrate understanding by reflecting on what you hear about the client’s thoughts and feelings, adding depth where possible. The reflection skills are rarely just repeating what the visitor said. Often, they keep the conversation going, guessing what the person might mean or anticipating what might be the next sentence – a listening skill ...

  30. [30]

    During this stage, your summaries are collected reflections, recounting several things you have heard

    Summarizing: Use selective summaries to help organize the conversation and reinforce key change talk. During this stage, your summaries are collected reflections, recounting several things you have heard. For example, you can say, so far you’ve mentioned that you wonder how well your son is learning in class, and you’re also worried about a recent fight i...

  31. [31]

    Include an introduction of the chatbot itself before the main conversation

  32. [32]

    Try to talk concisely

  33. [33]

    Do not ask too many questions at one time

  34. [34]

    Do not move away from motivational interviewing

  35. [35]

    Ask about your clients’ own experience

  36. [36]

    Apply MI skills more flexibly

  37. [37]

    In the focusing stage, provide options if clients do not have any ideas

  38. [38]

    Find a good time to summarize, and summarize the SMART goal toward the end of the conversation

  39. [39]

    Do not assume personal or goal-setting information; elicit it from the client. 9a. Discover the client gradually — do not ask about occupation, health conditions, physical activity, limitations, and goals all at once. Spread these questions naturally across the Engaging and Focusing stages; ask one thing at a time and only after the previous topic has bee...

  40. [40]

    Ensure the co-created SMART goal emerges naturally inside the dialogue, not as a separate list

  41. [41]

    Ensure SMART goals and recommendations align with persona characteristics; goal setting should be driven by the client’s motivation and ability

  42. [42]

    Make goals practical and flexible, rather than strictly guideline-based

  43. [43]

    Do not ask the clients to make more than 2 different goals

  44. [44]

    Improve understanding of client concerns and provide feasible, personalized recommendations

  45. [45]

    Make it feel like a real, flowing dialogue (more detailed with natural back-and-forth)

  46. [46]

    Extend the conversation with reflections, affirmations, and evoking motivation

  47. [47]

    Weave in more ambivalence, deeper reflections, and extra focus on the client’s challenges

  48. [48]

    I want,"

    Please do not assume any name for the coach or client. Full client system prompt (used at SFT data generation, SFT training, DPO training, and evaluation).The client prompt is constructed by substituting the persona JSON ( persona_text) and a sampled trait descriptor (trait_description) into the following template: You are this person: {persona_text} Your...