Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

Chao Luo; Luhui Liu; Xia Zeng; Ye Chen; Yihan Chen; Zhuoran Zhuang

arxiv: 2510.04214 · v3 · submitted 2025-10-05 · 💻 cs.CL

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

Xia Zeng , Yihan Chen , Luhui Liu , Chao Luo , Ye Chen , Zhuoran Zhuang This is my paper

Pith reviewed 2026-05-18 10:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM alignmentreinforcement learningpersuasive dialoguereward modelingpolicy optimizationdialogue agentsheterogeneous rewardsnegotiation systems

0 comments

The pith

REPO blends preference models, LLM judges, and rules to train LLMs as more persuasive negotiators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops Reward-Enhanced Policy Optimization to train large language models as business development agents that negotiate prices while following strict procedures and guardrails. The method combines a preference-trained reward model, an LLM acting as judge for qualities like emotional value and compliance, and rule-based checks for numbers and formatting. Evaluations show gains in expert ratings for dialogues and a high rate of fixing problematic cases, with further confirmation from live customer tests. A sympathetic reader would care because the work demonstrates a concrete way to align models for complex, multi-turn tasks that require both effectiveness and reliability.

Core claim

Reward-Enhanced Policy Optimization (REPO) is a reinforcement learning post-training approach that optimizes an LLM policy by jointly using a preference-trained reward model, an LLM-as-a-judge for nuanced behaviors such as emotional engagement and SOP compliance, and rule-based reward functions for deterministic checks on numerics and guardrails. Applied to persuasive price negotiation in online travel agencies, REPO raises average dialogue ratings to 4.63, increases the share of conversations with at least one excellent response to 66.67 percent, fixes 93.33 percent of bad cases with 75.56 percent clean fixes, and delivers a 12.14 percentage point lift in response rate plus a 5.94 pointlift

What carries the argument

Reward-Enhanced Policy Optimization (REPO), which merges three heterogeneous reward sources during policy updates to guide alignment in multi-turn persuasive dialogues.

Load-bearing premise

The three reward sources can be combined without creating conflicts or allowing the model to game one source while ignoring the others.

What would settle it

A head-to-head test in which a policy trained on only the strongest single reward source matches or exceeds REPO on both expert ratings and production response rates would undermine the need for the combined approach.

read the original abstract

We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guardrails (no over-promising and no hallucinations), while remaining human-like and effective over long, multi-turn dialogues. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method that combines heterogeneous rewards: a preference-trained reward model (RM), an LLM-as-a-judge (RJ) for nuanced behaviors (e.g., emotional value and SOP compliance), and rule-based reward functions (RF) (mainly regex-based) for deterministic checks on numerics, formatting, and guardrails. In expert consensus evaluation (three human experts; 30 online conversations and 45 curated bad cases), REPO improves average dialogue rating to 4.63 (+0.33 over GRPO) and raises the share of conversations with at least one excellent response to 66.67% (+23.34 pp over GRPO), while achieving a 93.33% bad-case fix rate with 75.56% clean fixes. In a production A/B test on 9,653 real customer conversations (vs. an intent-driven dialogue system), REPO improves response rate by +12.14 pp and task success rate by +5.94 pp (p<0.001).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows practical gains from mixing preference rewards, LLM judges, and rules in post-training an LLM for constrained OTA price negotiations, but the live A/B test compares against an older intent system rather than the GRPO baseline so the specific contribution of REPO stays unclear.

read the letter

The core takeaway is that REPO combines a preference reward model, an LLM-as-judge for nuanced traits like emotional tone and SOP adherence, and simple regex rules for guardrails, then applies this mix during policy optimization for multi-turn persuasive dialogues. They report better expert ratings than GRPO on a modest set of conversations and bad cases, plus higher response and success rates in a production A/B test with real customers.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training approach that integrates three heterogeneous reward sources—a preference-trained reward model (RM), an LLM-as-judge (RJ) for nuanced behaviors such as emotional value and SOP compliance, and rule-based reward functions (RF) for deterministic checks on numerics, formatting, and guardrails—to align LLMs as persuasive business development agents in online travel agencies. The agents must adhere to a multi-stage Standard Operating Procedure and strict guardrails while maintaining human-like performance in long multi-turn dialogues. The paper reports that REPO achieves an average dialogue rating of 4.63 (+0.33 over GRPO) and 66.67% of conversations with at least one excellent response (+23.34 pp over GRPO) in expert consensus evaluation on 30 online conversations and 45 curated bad cases, along with a 93.33% bad-case fix rate. In a production A/B test on 9,653 real customer conversations, REPO improves response rate by +12.14 pp and task success rate by +5.94 pp (p<0.001) relative to an intent-driven dialogue system.

Significance. If the results hold after addressing the comparison and reproducibility issues, the work offers a practical demonstration of combining diverse reward signals (preference, LLM judgment, and rule-based) for LLM alignment in constrained, high-stakes persuasive dialogue. The inclusion of both expert consensus ratings and a large-scale production A/B test strengthens the practical relevance for real-world deployment of LLM agents that must balance flexibility with strict procedural and guardrail adherence.

major comments (2)

[Experiments / Production A/B Test] The production A/B test (described in the Experiments section) compares REPO to an intent-driven dialogue system rather than to GRPO. This design choice means the reported improvements in response rate (+12.14 pp) and task success rate (+5.94 pp) cannot be attributed specifically to REPO's integration of heterogeneous rewards (RM, RJ, RF) versus the general effect of deploying any LLM-based agent. A controlled production comparison holding the base LLM and SOP fixed while varying only the training method (REPO vs. GRPO) is required to support the central claim of transfer from expert evaluations to live outcomes.
[REPO Method] The REPO method description provides no information on how the three heterogeneous rewards are weighted or combined into the policy optimization objective. Since reward combination weights are free parameters (as noted in the reader's analysis), this omission prevents assessment of whether reward conflicts or hacking were avoided and undermines reproducibility of the reported gains over GRPO.

minor comments (2)

[Expert Consensus Evaluation] The abstract and evaluation sections should clarify the exact procedure for reaching expert consensus (e.g., majority vote, discussion) and report inter-rater agreement metrics to strengthen the reliability of the +0.33 rating improvement and 66.67% excellent-response share claims.
[Method] Training details such as the number of optimization steps, learning rate schedule, and exact form of the combined reward in the REPO objective are missing; adding these would aid replication without altering the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have made revisions to improve clarity and address the concerns where feasible.

read point-by-point responses

Referee: [Experiments / Production A/B Test] The production A/B test (described in the Experiments section) compares REPO to an intent-driven dialogue system rather than to GRPO. This design choice means the reported improvements in response rate (+12.14 pp) and task success rate (+5.94 pp) cannot be attributed specifically to REPO's integration of heterogeneous rewards (RM, RJ, RF) versus the general effect of deploying any LLM-based agent. A controlled production comparison holding the base LLM and SOP fixed while varying only the training method (REPO vs. GRPO) is required to support the central claim of transfer from expert evaluations to live outcomes.

Authors: We appreciate this observation. The production A/B test evaluates the deployed REPO agent against the existing intent-driven production system to demonstrate practical improvements in real customer interactions on a large scale (9,653 conversations). The expert consensus evaluation provides the controlled comparison to GRPO. We acknowledge that a direct production A/B test between REPO and GRPO would more precisely isolate the contribution of heterogeneous reward integration. Operational constraints prevented running such a test during the study period. In the revised manuscript, we will add explicit discussion of this limitation and clarify the complementary roles of the expert evaluation and production results. revision: partial
Referee: [REPO Method] The REPO method description provides no information on how the three heterogeneous rewards are weighted or combined into the policy optimization objective. Since reward combination weights are free parameters (as noted in the reader's analysis), this omission prevents assessment of whether reward conflicts or hacking were avoided and undermines reproducibility of the reported gains over GRPO.

Authors: We thank the referee for identifying this gap. The manuscript describes the combination of RM, RJ, and RF at a conceptual level but omits specifics on weighting and aggregation in the policy optimization objective. We will revise the REPO method section to detail the reward combination strategy, including the weights or scaling factors applied to each component, normalization procedures, and mechanisms to balance signals and mitigate conflicts or hacking risks. This will improve reproducibility and allow assessment of the approach. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains validated by external human ratings and production A/B test

full rationale

The paper introduces REPO as an RL post-training method that aggregates three distinct reward sources (preference RM, LLM-as-judge RJ, rule-based RF) to optimize an LLM policy under SOP guardrails. All reported performance lifts—average dialogue rating 4.63, excellent-response share 66.67 %, bad-case fix rate 93.33 %, and production A/B improvements of +12.14 pp response rate and +5.94 pp task success—are obtained from separate human-expert consensus scoring on held-out conversations and from a live deployment test on 9,653 real customer interactions. No equations, fitted parameters, or self-citations are shown that would make these outcome metrics equivalent to quantities defined inside the reward-combination step itself; the evaluation protocol remains independent of the training loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unstated premise that the three reward signals can be aggregated into a stable training objective and that improvements observed in curated and production settings are attributable to this aggregation rather than other training factors.

free parameters (1)

Reward combination weights
The method requires balancing the preference RM, LLM judge, and rule-based signals, but no values or tuning procedure are stated in the abstract.

axioms (1)

domain assumption Heterogeneous rewards from preference models, LLM judges, and deterministic rules can be combined without introducing conflicts or reward hacking in multi-turn dialogue training.
This premise is required for REPO to produce the claimed alignment improvements.

pith-pipeline@v0.9.0 · 5800 in / 1434 out tokens · 58613 ms · 2026-05-18T10:19:32.022106+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

REPO ... combines heterogeneous rewards: a preference-trained reward model (RM), an LLM-as-a-judge (RJ) ... and rule-based reward functions (RF) ... Rtotal = Rmodel (1 ± Eenh / n) with clipping

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.