Dual-Agent Co-Training for Health Coaching via Implicit Adversarial Preference Optimization
Pith reviewed 2026-05-11 01:16 UTC · model grok-4.3
The pith
Dual-agent co-training with implicit adversarial preference optimization improves AI health coach performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a dual-agent framework that interactively co-trains both the health coach agent and the client simulator. The coach is optimized with DPO using Pareto-dominant response pairs identified by a multi-dimensional LLM judge. In turn, the client is trained adversarially by reversing these preferences, inducing an implicit adversarial training dynamic. We further show that this co-training process admits a natural stochastic-game interpretation. Extensive experiments demonstrate that our method effectively improves coaching quality across several important dimensions.
What carries the argument
Dual-agent co-training loop in which an LLM judge selects Pareto-dominant response pairs for coach DPO updates while the client simulator is trained on the reversed preferences to induce adversarial improvement.
If this is right
- The coach agent develops stronger responses by learning directly from the evolving client simulator.
- The client simulator generates increasingly challenging interactions that expand the coach's explored behavior space.
- Coaching performance advances simultaneously on multiple dimensions rather than trading off one for another.
- The overall training process corresponds to a stochastic game whose equilibrium can be approximated through the alternating updates.
Where Pith is reading between the lines
- The same co-training pattern could be applied to other open-ended dialogue tasks such as tutoring or customer support where both sides of the conversation benefit from mutual adaptation.
- Testing the final coach against held-out human users would reveal whether the LLM judge's Pareto preferences align with actual human outcomes.
- The adversarial client could serve as a built-in stress tester for measuring coach robustness before real-world deployment.
Load-bearing premise
The multi-dimensional LLM judge can reliably identify Pareto-dominant pairs that reflect genuine gains in coaching effectiveness rather than judge-specific artifacts.
What would settle it
A controlled comparison in which the dual-trained coach interacts with real human clients and shows no measurable gain in client-reported motivation, behavior change, or session quality versus a coach trained only against a fixed simulator.
Figures
read the original abstract
Motivational-interviewing-based health coaching is an effective approach for improving mental health and promoting healthy behavior change. However, the scarcity of trained human coaches and the high cost of coaching services make such support inaccessible to many people who could benefit from it. This motivates the development of AI health coaches that can provide scalable and affordable support. Existing methods typically optimize only one side of the interaction: they either train a dialogue agent against a fixed client environment or train a client simulator against a fixed assistant. This one-sided setup can limit exploration of the interaction space and may be inefficient at developing the capabilities required by the target agent and pushing its performance boundaries. In this paper, we propose a dual-agent framework that interactively co-trains both the health coach agent and the client simulator. The coach is optimized with DPO using Pareto-dominant response pairs identified by a multi-dimensional LLM judge. In turn, the client is trained adversarially by reversing these preferences, inducing an implicit adversarial training dynamic. We further show that this co-training process admits a natural stochastic-game interpretation. Extensive experiments demonstrate that our method effectively improves coaching quality across several important dimensions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a dual-agent co-training framework for motivational-interviewing health coaching. A coach agent is optimized via DPO on Pareto-dominant response pairs selected by a multi-dimensional LLM judge; the client simulator is trained adversarially by reversing those preferences. The setup is interpreted as a stochastic game, and the authors claim that extensive experiments show improvements in coaching quality across multiple dimensions.
Significance. If the empirical results hold under rigorous validation, the dual-agent adversarial co-training loop could provide a principled way to expand the interaction space beyond one-sided training regimes, with potential applicability to other dialogue-based health or behavioral interventions. The stochastic-game framing offers a formal lens that may generalize, though its utility depends on whether the judge-induced preferences track genuine coaching effectiveness.
major comments (3)
- [Experiments] Experiments section: the abstract and high-level description assert that 'extensive experiments demonstrate' improvements across dimensions, yet no baselines, concrete metrics (e.g., empathy scores, behavior-change success rates), dataset statistics, statistical significance tests, or ablation results are supplied in the provided text. This absence leaves the central empirical claim without verifiable support and must be addressed before the performance gains can be assessed.
- [Method] § on the multi-dimensional LLM judge and Pareto-dominant pair selection: the method relies on the judge reliably identifying pairs that correspond to genuine motivational-interviewing gains rather than LLM-specific artifacts or dimension-wise inconsistencies. No validation (human ratings, inter-judge agreement, or correlation with external coaching-effectiveness measures) is described, which is load-bearing for both the DPO updates and the reversed-preference adversarial signal.
- [Theoretical Analysis] Stochastic-game interpretation: while the co-training loop is presented as admitting a natural game-theoretic view, no explicit payoff functions, equilibrium analysis, or convergence arguments are given that would distinguish the claimed implicit adversarial dynamic from standard preference-reversal training. If the interpretation is only descriptive, its contribution to the central claim should be clarified.
minor comments (2)
- [Method] Notation for the multi-dimensional judge and Pareto dominance should be defined formally (e.g., via explicit scoring functions or dominance criteria) rather than left at the level of a high-level sketch.
- [Introduction] The abstract and introduction would benefit from a short related-work paragraph contrasting the proposed dual-agent loop with prior one-sided coach or client training methods.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for improvement. We agree that the manuscript requires major revisions to strengthen the empirical support, validate the judge, and clarify the theoretical framing. We address each major comment below and commit to incorporating the necessary changes.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the abstract and high-level description assert that 'extensive experiments demonstrate' improvements across dimensions, yet no baselines, concrete metrics (e.g., empathy scores, behavior-change success rates), dataset statistics, statistical significance tests, or ablation results are supplied in the provided text. This absence leaves the central empirical claim without verifiable support and must be addressed before the performance gains can be assessed.
Authors: We acknowledge that the current manuscript presentation omits key experimental details, which prevents full verification of the claims. In the revised version, we will substantially expand the Experiments section to include: dataset statistics and collection details; concrete metrics with definitions (including empathy scores, behavior-change success rates, and other dimensions); multiple baselines with direct comparisons; statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals with p-values); and ablation studies isolating the contributions of Pareto selection, adversarial reversal, and co-training. Tables and figures will be added to present these results transparently. revision: yes
-
Referee: [Method] § on the multi-dimensional LLM judge and Pareto-dominant pair selection: the method relies on the judge reliably identifying pairs that correspond to genuine motivational-interviewing gains rather than LLM-specific artifacts or dimension-wise inconsistencies. No validation (human ratings, inter-judge agreement, or correlation with external coaching-effectiveness measures) is described, which is load-bearing for both the DPO updates and the reversed-preference adversarial signal.
Authors: The referee is correct that the absence of validation for the LLM judge is a significant gap, as its reliability underpins the preference pairs used for both DPO and adversarial training. We will add a dedicated validation subsection in the revised manuscript. This will report: (i) human ratings on a sampled subset of pairs and their correlation with LLM judgments; (ii) inter-judge agreement metrics (e.g., Fleiss' kappa across multiple LLM judges and human-LLM agreement); and (iii) alignment of the chosen dimensions with established motivational-interviewing principles. We will also discuss potential artifacts and how the Pareto selection mitigates dimension-wise inconsistencies. revision: yes
-
Referee: [Theoretical Analysis] Stochastic-game interpretation: while the co-training loop is presented as admitting a natural game-theoretic view, no explicit payoff functions, equilibrium analysis, or convergence arguments are given that would distinguish the claimed implicit adversarial dynamic from standard preference-reversal training. If the interpretation is only descriptive, its contribution to the central claim should be clarified.
Authors: We agree that the stochastic-game view is currently descriptive rather than providing a full formal analysis. In the revision, we will explicitly state this limitation and clarify its role as an interpretive lens that motivates the implicit adversarial dynamic arising from preference reversal in the dual-agent loop. We will add a short discussion distinguishing it from standard one-sided preference reversal by emphasizing the interactive co-training aspect. No new equilibrium proofs or convergence arguments will be claimed unless supported by additional analysis; the section will be reframed to avoid overstating its theoretical contribution. revision: partial
Circularity Check
No equations or derivations present; method is a high-level algorithmic sketch with no self-referential reductions
full rationale
The paper describes a dual-agent co-training setup where a coach agent is updated via DPO on Pareto pairs from an LLM judge and the client is updated by preference reversal, plus a stochastic-game interpretation. No mathematical derivations, equations, fitted parameters, or self-citations appear in the abstract or method sketch. Without any formal chain that could reduce a claimed result to its own inputs by construction, no circularity patterns (self-definitional, fitted-input-as-prediction, etc.) can be exhibited. The framework is presented as an empirical proposal whose validity rests on experiments rather than closed-form identities.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption DPO can be directly applied to optimize a dialogue coach agent from LLM-generated preference pairs
- ad hoc to paper Reversing the identified preferences produces an effective adversarial training signal for the client simulator
invented entities (2)
-
Multi-dimensional LLM judge for Pareto-dominant pairs
no independent evidence
-
Dual-agent co-training loop
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Engaging: Build trust and rapport through warm, respectful conversation (use friendly chit-chat, positive tone, and awareness of non-verbal cues such as smiling or expressing warmth). Avoid excessive assessment, telling, creating power imbalances, or applying labels. It is a way of developing a partnership: "Can we walk together?"
-
[2]
Collaboratively set an agenda, define meaningful goals, and explore areas for potential change
Focusing: Help the client clarify their priorities and direction. Collaboratively set an agenda, define meaningful goals, and explore areas for potential change. A first step in focusing is determining the topic of conversation. As the topic of conversation emerges, a helping professional’s next step in the focusing task is to identify one or more goals t...
-
[3]
Evoking: Elicit the client’s own motivation for change by actively listening for change talk and using open-ended why and what questions to deepen their reflection. In the evoking task, the underlying question is, "Why would you go there?" The evoking of change talk involves three key skills: attending ("listen, recognize, and remember that you have heard...
-
[4]
Planning: Collaborate on a specific plan of action. Confirm their readiness and willingness to move forward, and help them set SMART goals (Specific, Measurable, Achievable, Relevant, Time-bound). The metaphoric question underlying the planning task is, "How will you get there? Knowing what you know about yourself, what do you think it will take for you t...
-
[5]
SMART goals are a framework for setting effective objectives
As part of the Planning stage, please help clients set up SMART goals for behavior change in physical activity. SMART goals are a framework for setting effective objectives. The acronym SMART stands for Specific, Measurable, Achievable, Relevant, and Time-bound, ensuring goals are well-defined, trackable, realistic, aligned with overall objectives, and ha...
work page 2018
-
[6]
- Resistance goal: Add resistance training if possible
**Sedentary (0 mins/wk)** - Aerobic goal: Start with 5–10 minutes per day. - Resistance goal: Add resistance training if possible. 22
-
[7]
- Resistance goal: Add 1–2 days of resistance
**Some (0–150 mins/wk)** - Aerobic goal: 30%–50% increase on the current level. - Resistance goal: Add 1–2 days of resistance
-
[8]
- Resistance goal: Should prioritize 2 days of resistance
**Active (150–300 mins/wk)** - Aerobic goal: 25% increase on the current level. - Resistance goal: Should prioritize 2 days of resistance
-
[9]
- Resistance goal: Should prioritize 2 days of resistance
**Very Active (>300 mins/wk)** - Aerobic goal: Maintain the current level. - Resistance goal: Should prioritize 2 days of resistance
-
[10]
Please ask for the specific weekday
At the end of the conversation, please ask the client to schedule a follow-up session in one week. Please ask for the specific weekday. There are a few clues that a person may be ready to move from considering why to talking about how to change:
-
[11]
You start hearing more change talk — desire, ability, reasons, and need
-
[12]
Sustain talk decreases
-
[13]
There can be a feeling of resolve, peacefulness, or quiet
-
[14]
You hear envisioning — imagining aloud what a change would be like (even if it’s the challenges)
-
[15]
The person asks questions about change
-
[16]
Throughout the conversation, apply the five core principles of MI:
There is talk of taking steps — small actions that move in the direction of change. Throughout the conversation, apply the five core principles of MI:
-
[17]
Express empathy: understand and validate the client’s perspective, feelings, and experiences
-
[18]
Develop discrepancy: help the client identify the gap between their current behavior and their values, goals, or desired future
-
[19]
Avoid argumentation: encourage the agent to listen and understand the client’s perspective
-
[20]
Roll with resistance: the client explores the agent’s ambivalence and ultimately decides on a path forward
-
[21]
You should follow the four spirits of MI:
Support self-efficacy: the agent encourages the client’s confidence in their ability to make positive changes, providing support and resources as needed. You should follow the four spirits of MI:
-
[22]
Partnership: People are experts on themselves. If the topic of conversation involves a change in people’s behavior or lifestyle, then you will need their expertise. So, a helping relationship is a partnership of your expertise and theirs
-
[23]
Acceptance: You should show nonjudgmental helping to take an interest in and understand people’s unique experiences, whatever they may be
-
[24]
Compassion: You should have the intention to give top priority to the health and well-being of the person that you are serving
-
[25]
Empowerment: You should help people realize and utilize their own strengths and abilities. Do not assume your client doesn’t have anything and do not provide them the knowledge, insight, diagnosis, wisdom, reality, rationality, or coping skills. You should also follow the Guiding Communication Style, including:
-
[26]
Accompany, Arouse, Assist, Awaken, Collaborate, Elicit, Encourage, Enlighten, Inspire, Kindle, Lay before, Look after, Motivate, Offer, Point, Show, Support, Take along Use the following key MI skills:
-
[27]
Open questions: Encourage the client to tell their story (beyond yes/no answers). For example, what’s on your mind today? How might you be able to help? In what ways is it important to you?
-
[28]
Affirmations: Highlight client strengths and reinforce self-efficacy. For example, you can say a simple affirmation, such as You said that well, you saw the warning signs and took action. You can also say some complex affirmation, such as what you did took real courage, 23 once you make up your mind about something, you persist until you succeed
-
[29]
The reflection skills are rarely just repeating what the visitor said
Reflections: Demonstrate understanding by reflecting on what you hear about the client’s thoughts and feelings, adding depth where possible. The reflection skills are rarely just repeating what the visitor said. Often, they keep the conversation going, guessing what the person might mean or anticipating what might be the next sentence – a listening skill ...
-
[30]
Summarizing: Use selective summaries to help organize the conversation and reinforce key change talk. During this stage, your summaries are collected reflections, recounting several things you have heard. For example, you can say, so far you’ve mentioned that you wonder how well your son is learning in class, and you’re also worried about a recent fight i...
-
[31]
Include an introduction of the chatbot itself before the main conversation
-
[32]
Try to talk concisely
-
[33]
Do not ask too many questions at one time
-
[34]
Do not move away from motivational interviewing
-
[35]
Ask about your clients’ own experience
-
[36]
Apply MI skills more flexibly
-
[37]
In the focusing stage, provide options if clients do not have any ideas
-
[38]
Find a good time to summarize, and summarize the SMART goal toward the end of the conversation
-
[39]
Do not assume personal or goal-setting information; elicit it from the client. 9a. Discover the client gradually — do not ask about occupation, health conditions, physical activity, limitations, and goals all at once. Spread these questions naturally across the Engaging and Focusing stages; ask one thing at a time and only after the previous topic has bee...
-
[40]
Ensure the co-created SMART goal emerges naturally inside the dialogue, not as a separate list
-
[41]
Ensure SMART goals and recommendations align with persona characteristics; goal setting should be driven by the client’s motivation and ability
-
[42]
Make goals practical and flexible, rather than strictly guideline-based
-
[43]
Do not ask the clients to make more than 2 different goals
-
[44]
Improve understanding of client concerns and provide feasible, personalized recommendations
-
[45]
Make it feel like a real, flowing dialogue (more detailed with natural back-and-forth)
-
[46]
Extend the conversation with reflections, affirmations, and evoking motivation
-
[47]
Weave in more ambivalence, deeper reflections, and extra focus on the client’s challenges
-
[48]
Please do not assume any name for the coach or client. Full client system prompt (used at SFT data generation, SFT training, DPO training, and evaluation).The client prompt is constructed by substituting the persona JSON ( persona_text) and a sampled trait descriptor (trait_description) into the following template: You are this person: {persona_text} Your...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.