One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems

Enguo Zhou; Jing Xiang; Lang Gao; Mingzhe Li; Qishen Zhang; Tai Li; Xiangliang Zhang; Xiuying Chen

arxiv: 2606.09293 · v1 · pith:KNQ4TAF7new · submitted 2026-06-08 · 💻 cs.CL

One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems

Mingzhe Li , Jing Xiang , Enguo Zhou , Lang Gao , Tai Li , Qishen Zhang , Xiangliang Zhang , Xiuying Chen This is my paper

Pith reviewed 2026-06-27 16:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-objective reinforcement learninge-commerce dialogue systemsreasoning constraintsadaptive rewardsnatural language generationconversion optimizationuser satisfaction

0 comments

The pith

MORE treats reasoning accuracy as optimization constraints to jointly improve decision-making and natural responses in e-commerce dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MORE, an adaptive multi-objective reinforcement learning framework for dialogue systems that must reason correctly over user profiles such as eligibility while also producing natural and faithful responses. Direct mixing of rewards leads to oscillations, so the method enforces reasoning functions as constraints that guide policy updates and adds an adaptive mechanism to reweight linguistic signals like fluency based on gradient feedback. The trained policy then generates responses at inference time without any explicit reasoning steps or added computation. Production experiments on ByteDance traffic show gains in conversion metrics and user satisfaction alongside lower handoff rates to humans.

Core claim

MORE jointly optimizes reasoning accuracy and linguistic naturalness by treating reasoning functions as constraints that guide policy optimization rather than mixing them into a single reward, combined with an adaptive multi-reward mechanism that dynamically reweights signals such as fluency and naturalness via gradient feedback. The resulting system generates natural responses directly at inference time while retaining benefits from the reasoning-enhanced training scaffold.

What carries the argument

Reasoning functions enforced as constraints on policy optimization, together with an adaptive multi-reward aggregator that reweights linguistic signals based on gradient feedback.

If this is right

Improves overall conversion by 16.53 percent and reached conversion by 30.09 percent in 14-day online experiments on production traffic.
Increases user satisfaction and reduces handoff rates to human agents.
Recovers about 60 percent of the incremental conversion lift that human agents achieve over baselines.
Outperforms strong baselines on two real-world ByteDance dialogue systems and on the MultiWOZ 2.2 benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The constraint formulation could extend to other dialogue domains that require hidden reasoning during training but direct generation at test time.
Removing inference overhead positions the method for high-volume customer service where latency matters.
Gradient-based reweighting of multiple rewards may stabilize training when objectives conflict in non-e-commerce settings.

Load-bearing premise

Treating reasoning functions strictly as constraints prevents oscillations and unstable learning when objectives have diverging dynamics.

What would settle it

An ablation that mixes reasoning and linguistic rewards directly into one objective, then measures whether learning remains stable and gains remain comparable in the same 14-day production traffic, would test the constraint approach.

Figures

Figures reproduced from arXiv: 2606.09293 by Enguo Zhou, Jing Xiang, Lang Gao, Mingzhe Li, Qishen Zhang, Tai Li, Xiangliang Zhang, Xiuying Chen.

**Figure 3.** Figure 3: Our MORE framework. A reasoning-enhanced scaffold improves reasoning correctness, while adaptive multi-reward [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Comparison of old vs. new models in a loan inter [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: D Prompt Template We provide the detailed prompt templates used for dialogue generation in our experiments in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 6.** Figure 6: An example of the structured input. Structured output example "phone_last_digits": "9876", "id_tail": "321", "credit_limit": "1000 CNY", "fangxinjie_rate": "19.8%", "coupon_amount": "30000 CNY", "coupon_expiry": "June second", "interest_10k_1day": "5.42 CNY", "interest_10k_30days": "162.74 CNY", "interest_30k_1day": "16.27 CNY" [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: An example of the generated structured output. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: The detailed prompt template used for dialogue generation. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complementary but not identical. In this work, we propose MORE, an adaptive Multi-Objective REinforcement learning framework that jointly optimizes reasoning accuracy and linguistic naturalness. Our preliminary experiments show that directly mixing rewards with diverging optimization dynamics can cause oscillations and unstable learning. Thus, instead of optimizing a single mixed reward, we treat reasoning functions as constraints that guide policy optimization. At inference time, the system directly generates responses without explicit reasoning steps, while still benefiting from reasoning-enhanced scaffold and avoiding additional inference overhead. To better balance linguistic objectives during response generation, we introduce an adaptive multi-reward mechanism that aggregates signals such as fluency and naturalness and dynamically reweighs them via gradient feedback. We evaluate MORE on two real-world dialogue systems at ByteDance and the MultiWOZ 2.2 benchmark, where it consistently outperforms strong baselines. In 14-day online experiments on ByteDance production traffic, MORE improves overall and reached conversion by 16.53% and 30.09%, while increasing user satisfaction and reducing handoff rates. Notably, in a human-machine comparison, MORE recovers about 60% of the incremental conversion lift achieved by human agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports solid production lifts from constraining reasoning accuracy in multi-objective dialogue RL, but supplies no ablations or details on why that design choice beats mixing rewards.

read the letter

The core result is that MORE, by treating reasoning functions as hard constraints rather than mixing them into the reward, improves conversion metrics by 16.53% overall and 30.09% on reached conversions in 14-day ByteDance traffic tests, while also lifting satisfaction and cutting handoffs. It recovers roughly 60% of the gap to human agents.

What stands out is the applied focus: real e-commerce traffic, two production systems, and the MultiWOZ benchmark. The adaptive gradient-driven reweighting of fluency and naturalness signals is a practical way to avoid hand-tuned scalars when linguistic objectives pull in different directions. Reporting online A/B results at all is better than most dialogue papers.

The weak point is the justification for the constraint formulation. The abstract states that mixing rewards produced oscillations in preliminary runs, yet those runs, the oscillation metric, and any direct comparison of the two approaches are missing. There are also no ablations on how the constraints are enforced during policy optimization, no statistical significance on the lifts, and no breakdown of which component drove the gains. Without that, the numbers are hard to attribute.

This paper is aimed at teams shipping production dialogue agents that must balance factual accuracy with conversational quality. A reader who wants new RL theory or fully reproducible experiments will find it thin. A practitioner who needs evidence that something worked at scale will get some value.

It deserves peer review once the authors add the missing ablations and implementation specifics on the constraint mechanism; the online results are the kind of data that warrants referee time even if the method section needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper proposes MORE, an adaptive multi-objective reinforcement learning framework for e-commerce dialogue systems. It jointly optimizes reasoning accuracy (treated as hard constraints) and linguistic naturalness (via an adaptive multi-reward aggregator that dynamically reweights signals such as fluency), avoiding direct reward mixing to prevent oscillations. At inference the model generates responses directly without explicit reasoning steps. Evaluations on ByteDance production systems and MultiWOZ 2.2 report consistent outperformance of baselines; 14-day online A/B tests show 16.53% and 30.09% lifts in overall and reached conversion, plus gains in satisfaction and reduced handoffs, recovering ~60% of the incremental lift from human agents.

Significance. If the production results and the constraint-based stability claim hold after proper validation, the work would offer a practical, deployable solution for multi-objective dialogue optimization in high-stakes e-commerce settings. The separation of reasoning constraints from linguistic rewards and the gradient-based adaptive aggregator address a real tension in production systems and could generalize beyond the reported domains.

major comments (3)

[Abstract and §3] Abstract and §3 (Method): The central design decision to treat reasoning functions strictly as constraints (rather than mixing rewards) rests on the assertion that preliminary experiments showed oscillations and unstable learning, yet the manuscript supplies neither those experiments, an oscillation metric, nor any ablation comparing the two formulations across domains or reward scales. This is load-bearing for the claimed stability advantage.
[§5] §5 (Online Experiments): The 14-day production A/B test results (16.53% overall conversion lift, 30.09% reached-conversion lift) are presented without statistical significance tests, confidence intervals, traffic volume, randomization details, or pre-registered metric definitions. These omissions prevent verification that the reported gains are attributable to the proposed constraint mechanism versus other unstated factors.
[§4] §4 (Adaptive Multi-Reward): The adaptive reweighting mechanism is described at a high level via gradient feedback, but the manuscript lacks the explicit update rule or loss formulation (e.g., how weights are computed from per-objective gradients) and provides no sensitivity analysis or ablation on the reweighting hyperparameters.

minor comments (2)

[§3] The inference-time claim that the model benefits from the reasoning scaffold without additional overhead would be clearer with a short diagram or pseudocode showing the training vs. inference pipelines.
[§5] Table or figure captions for the MultiWOZ results should explicitly state the evaluation metrics (e.g., success rate, BLEU) and whether human or automatic judgments were used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for strengthening the manuscript's claims on stability, experimental rigor, and methodological transparency. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The central design decision to treat reasoning functions strictly as constraints (rather than mixing rewards) rests on the assertion that preliminary experiments showed oscillations and unstable learning, yet the manuscript supplies neither those experiments, an oscillation metric, nor any ablation comparing the two formulations across domains or reward scales. This is load-bearing for the claimed stability advantage.

Authors: We agree that the preliminary experiments and supporting metrics are essential to substantiate the stability claim and should have been included. In the revision we will add a dedicated subsection (or appendix) presenting the oscillation metrics (e.g., reward variance and policy gradient norm over training steps), the exact experimental setup, and a direct ablation comparing reward mixing versus the constraint formulation on both the ByteDance production data and MultiWOZ 2.2. This will make the design decision fully verifiable. revision: yes
Referee: [§5] §5 (Online Experiments): The 14-day production A/B test results (16.53% overall conversion lift, 30.09% reached-conversion lift) are presented without statistical significance tests, confidence intervals, traffic volume, randomization details, or pre-registered metric definitions. These omissions prevent verification that the reported gains are attributable to the proposed constraint mechanism versus other unstated factors.

Authors: We acknowledge the need for greater statistical transparency. The revised §5 will report p-values from appropriate significance tests, 95% confidence intervals, approximate daily traffic volume (aggregated to respect privacy constraints), randomization procedure, and the pre-registered metric definitions. While exact per-day traffic counts cannot be disclosed for proprietary reasons, the added information will allow readers to assess the reliability and attribution of the reported lifts. revision: partial
Referee: [§4] §4 (Adaptive Multi-Reward): The adaptive reweighting mechanism is described at a high level via gradient feedback, but the manuscript lacks the explicit update rule or loss formulation (e.g., how weights are computed from per-objective gradients) and provides no sensitivity analysis or ablation on the reweighting hyperparameters.

Authors: We will expand §4 with the precise mathematical formulation, including the gradient-based weight update rule and the composite loss expression. The revision will also add a sensitivity analysis table and ablation experiments varying the key reweighting hyperparameters, demonstrating robustness across the reported domains. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation or results

full rationale

The paper presents an empirical RL framework (MORE) motivated by preliminary experiments on reward mixing, with performance evaluated via production A/B tests and MultiWOZ. No mathematical derivations, equations, first-principles predictions, or load-bearing self-citations appear in the abstract or described content. The design choice to use constraints is justified by external (unshown) experiments rather than reducing to a self-referential definition or fitted input renamed as prediction. Central claims rest on external traffic data rather than any internal chain that collapses to inputs by construction. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents enumeration of specific free parameters or axioms; the framework implicitly assumes that reasoning accuracy can be encoded as a hard constraint without degrading linguistic quality and that gradient feedback suffices to stabilize multi-reward aggregation.

pith-pipeline@v0.9.1-grok · 5804 in / 1301 out tokens · 17826 ms · 2026-06-27T16:40:15.612456+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Thomas Back. 1994. Selective pressure in evolutionary algorithms: A charac- terization of selection mechanisms. InProceedings of the first IEEE conference on evolutionary computation. IEEE World Congress on Computational Intelligence. IEEE, 57–62

1994
[2]

Namo Bang, Jeehyun Lee, and Myoung-Wan Koo. 2023. Task-Optimized Adapters for an End-to-End Task-Oriented Dialogue System. InFindings of the Association for Computational Linguistics: ACL 2023. 7355–7369

2023
[3]

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. MultiWOZ-A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 5016–5026

2018
[4]

Howard Chen, Huihan Li, Danqi Chen, and Karthik Narasimhan. 2022. Control- lable text generation with language constraints.arXiv preprint arXiv:2212.10466 (2022)

work page arXiv 2022
[5]

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems.Artificial Intelligence Review54, 1 (2021), 755–810

2021
[6]

Wenjie Dong, Sirong Chen, and Yan Yang. 2025. Protod: Proactive task-oriented dialogue system based on large language model. InProceedings of the 31st Inter- national Conference on Computational Linguistics. 9147–9164

2025
[7]

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization.ICML (2024)

2024
[8]

Yihao Feng, Shentao Yang, Shujian Zhang, Jianguo Zhang, Caiming Xiong, Mingyuan Zhou, and Huan Wang. 2023. Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems. InICLR

2023
[9]

Nikolaus Hansen and Andreas Ostermeier. 2001. Completely derandomized self-adaptation in evolution strategies.Evolutionary computation9, 2 (2001), 159–195

2001
[10]

Wanwei He, Yinpei Dai, Yinhe Zheng, Yuchuan Wu, Zheng Cao, Dermot Liu, Peng Jiang, Min Yang, Fei Huang, Luo Si, et al. 2022. Galaxy: A generative pre- trained model for task-oriented dialog with semi-supervised learning and explicit policy injection. InProceedings of the AAAI conference on artificial intelligence, Vol. 36. 10749–10757

2022
[11]

2010.Robust nonparametric statistical methods

Thomas P Hettmansperger and Joseph W McKean. 2010.Robust nonparametric statistical methods. CRC press

2010
[12]

Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. 2020. A simple language model for task-oriented dialogue.Advances in Neural Information Processing Systems33 (2020), 20179–20191

2020
[13]

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. 2019. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[14]

Lingxiao Kong, Cong Yang, Susanne Neufang, Oya Deniz Beyan, and Zeyd Boukhers. 2025. EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning.arXiv preprint arXiv:2505.02579(2025)

work page arXiv 2025
[15]

Wai-Chung Kwan, Hong-Ru Wang, Hui-Min Wang, and Kam-Fai Wong. 2023. A survey on recent advances and challenges in reinforcement learning methods for task-oriented dialogue policy learning.Machine Intelligence Research20, 3 (2023), 318–334

2023
[16]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al
[17]

RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. InInternational Conference on Machine Learning. PMLR, 26874– 26901
[18]

Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and William B Dolan. 2016. A Persona-Based Neural Conversation Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 994–1003

2016
[19]

Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean, Alan Ritter, and Dan Jurafsky
[20]

Deep reinforcement learning for dialogue generation. InEMNLP
[21]

Mingzhe Li, Xiuying Chen, Jing Xiang, Qishen Zhang, Changsheng Ma, Chenchen Dai, Jinxiong Chang, Zhongyi Liu, and Guannan Zhang. 2024. Multi-Intent Attribute-Aware Text Matching in Searching. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 360–368

2024
[22]

Mingzhe Li, Jing Xiang, Qishen Zhang, Kaiyang Wan, and Xiuying Chen. 2025. Flipping knowledge distillation: Leveraging small models’ expertise to enhance llms in text matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 22218–22229

2025
[23]

Ye Liu, Wolfgang Maier, Wolfgang Minker, and Stefan Ultes. 2021. Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues Using BERT. InProceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 839–845

2021
[24]

Do June Min, Veronica Perez-Rosas, Kenneth Resnicow, and Rada Mihalcea
[25]

Dynamic reward adjustment in multi-reward reinforcement learning for counselor reflection generation.COLING(2024)

2024
[26]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

2002
[27]

Dimple Patil. 2024. Artificial intelligence in retail and e-commerce: Enhancing customer experience through personalization, predictive analytics, and real-time engagement.Predictive Analytics, And Real-Time Engagement (November 26, 2024) (2024)

2024
[28]

Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2231–2240

2017
[29]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

2023
[30]

Vijay Mallik Reddy and Lakshmi Nivas Nalla. 2024. Personalization in e- commerce marketing: leveraging big data for tailored consumer engagement. Revista de Inteligencia Artificial en Medicina15, 1 (2024), 691–725

2024
[31]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
[32]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? How controllable attributes affect human judgments. In NAACL

2019
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

2024
[35]

Weiyan Shi, Yu Li, Saurav Sahay, and Zhou Yu. 2021. Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion Dialogues via Reinforcement Learn- ing and Human Demonstration. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3478–3492

2021
[36]

Haoyu Song, Wei-Nan Zhang, Jingwen Hu, and Ting Liu. 2020. Generating per- sona consistent dialogues by exploiting natural language inference. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8878–8885

2020
[37]

Zirui Song, Bin Yan, Yuhan Liu, Miao Fang, Mingzhe Li, Rui Yan, and Xiuying Chen. 2025. Injecting Domain-Specific Knowledge into Large Language Mod- els: A Comprehensive Survey. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Com...

2025
[38]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback.Advances in neural information processing KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Mingzhe Li et al. systems33 (2020), 3008–3021

2020
[39]

Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line active reward learning for policy optimisation in spoken dialogue systems. InACL

2016
[40]

Haipeng Sun, Junwei Bao, Youzheng Wu, and Xiaodong He. 2023. Mars: Modeling Context & State Representations with Contrastive Learning for End-to-End Task- Oriented Dialog. InFindings of the Association for Computational Linguistics: ACL

2023
[41]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[42]

Heng-Da Xu, Xian-Ling Mao, Puhai Yang, Fanshu Sun, and He-Yan Huang. 2024. Rethinking task-oriented dialogue systems: From complex modularity to zero- shot autonomous agent. InProceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers). 2748–2763

2024
[43]

Shiquan Yang, Rui Zhang, Sarah Erfani, and Jey Han Lau. 2022. An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4918–4935

2022
[44]

Yunyi Yang, Yunhao Li, and Xiaojun Quan. 2021. Ubar: Towards fully end-to-end task-oriented dialog system with gpt-2. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 14230–14238

2021
[45]

Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. Rrhf: Rank responses to align language models with human feedback.Advances in Neural Information Processing Systems36 (2023), 10935– 10950

2023
[46]

Changshuo Zhang, Sirui Chen, Xiao Zhang, Sunhao Dai, Weijie Yu, and Jun Xu
[47]

UOEP: User-Oriented Exploration Policy for Enhancing Long-Term User Experiences in Recommender Systems.arXiv preprint arXiv:2401.09034(2024)

work page arXiv 2024
[48]

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too?. InACL

2018
[49]

Xiaoying Zhang, Baolin Peng, Kun Li, Jingyan Zhou, and Helen Meng. 2023. SGP-TOD: Building Task Bots Effortlessly via Schema-Guided LLM Prompting. InFindings of the Association for Computational Linguistics: EMNLP 2023. 13348– 13369

2023
[50]

If you have any questions, feel free to contact me

Zheng Zhang, Ryuichi Takanobu, Qi Zhu, MinLie Huang, and XiaoYan Zhu. 2020. Recent advances and challenges in task-oriented dialog systems.Science China Technological Sciences63, 10 (2020), 2011–2027. A Limitations WhileMOREdemonstrates strong performance in balancing multi- ple dialogue objectives, several limitations remain. First, our eval- uation is l...

2020

[1] [1]

Thomas Back. 1994. Selective pressure in evolutionary algorithms: A charac- terization of selection mechanisms. InProceedings of the first IEEE conference on evolutionary computation. IEEE World Congress on Computational Intelligence. IEEE, 57–62

1994

[2] [2]

Namo Bang, Jeehyun Lee, and Myoung-Wan Koo. 2023. Task-Optimized Adapters for an End-to-End Task-Oriented Dialogue System. InFindings of the Association for Computational Linguistics: ACL 2023. 7355–7369

2023

[3] [3]

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. MultiWOZ-A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 5016–5026

2018

[4] [4]

Howard Chen, Huihan Li, Danqi Chen, and Karthik Narasimhan. 2022. Control- lable text generation with language constraints.arXiv preprint arXiv:2212.10466 (2022)

work page arXiv 2022

[5] [5]

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems.Artificial Intelligence Review54, 1 (2021), 755–810

2021

[6] [6]

Wenjie Dong, Sirong Chen, and Yan Yang. 2025. Protod: Proactive task-oriented dialogue system based on large language model. InProceedings of the 31st Inter- national Conference on Computational Linguistics. 9147–9164

2025

[7] [7]

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization.ICML (2024)

2024

[8] [8]

Yihao Feng, Shentao Yang, Shujian Zhang, Jianguo Zhang, Caiming Xiong, Mingyuan Zhou, and Huan Wang. 2023. Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems. InICLR

2023

[9] [9]

Nikolaus Hansen and Andreas Ostermeier. 2001. Completely derandomized self-adaptation in evolution strategies.Evolutionary computation9, 2 (2001), 159–195

2001

[10] [10]

Wanwei He, Yinpei Dai, Yinhe Zheng, Yuchuan Wu, Zheng Cao, Dermot Liu, Peng Jiang, Min Yang, Fei Huang, Luo Si, et al. 2022. Galaxy: A generative pre- trained model for task-oriented dialog with semi-supervised learning and explicit policy injection. InProceedings of the AAAI conference on artificial intelligence, Vol. 36. 10749–10757

2022

[11] [11]

2010.Robust nonparametric statistical methods

Thomas P Hettmansperger and Joseph W McKean. 2010.Robust nonparametric statistical methods. CRC press

2010

[12] [12]

Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. 2020. A simple language model for task-oriented dialogue.Advances in Neural Information Processing Systems33 (2020), 20179–20191

2020

[13] [13]

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. 2019. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[14] [14]

Lingxiao Kong, Cong Yang, Susanne Neufang, Oya Deniz Beyan, and Zeyd Boukhers. 2025. EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning.arXiv preprint arXiv:2505.02579(2025)

work page arXiv 2025

[15] [15]

Wai-Chung Kwan, Hong-Ru Wang, Hui-Min Wang, and Kam-Fai Wong. 2023. A survey on recent advances and challenges in reinforcement learning methods for task-oriented dialogue policy learning.Machine Intelligence Research20, 3 (2023), 318–334

2023

[16] [16]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al

[17] [17]

RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. InInternational Conference on Machine Learning. PMLR, 26874– 26901

[18] [18]

Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and William B Dolan. 2016. A Persona-Based Neural Conversation Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 994–1003

2016

[19] [19]

Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean, Alan Ritter, and Dan Jurafsky

[20] [20]

Deep reinforcement learning for dialogue generation. InEMNLP

[21] [21]

Mingzhe Li, Xiuying Chen, Jing Xiang, Qishen Zhang, Changsheng Ma, Chenchen Dai, Jinxiong Chang, Zhongyi Liu, and Guannan Zhang. 2024. Multi-Intent Attribute-Aware Text Matching in Searching. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 360–368

2024

[22] [22]

Mingzhe Li, Jing Xiang, Qishen Zhang, Kaiyang Wan, and Xiuying Chen. 2025. Flipping knowledge distillation: Leveraging small models’ expertise to enhance llms in text matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 22218–22229

2025

[23] [23]

Ye Liu, Wolfgang Maier, Wolfgang Minker, and Stefan Ultes. 2021. Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues Using BERT. InProceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 839–845

2021

[24] [24]

Do June Min, Veronica Perez-Rosas, Kenneth Resnicow, and Rada Mihalcea

[25] [25]

Dynamic reward adjustment in multi-reward reinforcement learning for counselor reflection generation.COLING(2024)

2024

[26] [26]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

2002

[27] [27]

Dimple Patil. 2024. Artificial intelligence in retail and e-commerce: Enhancing customer experience through personalization, predictive analytics, and real-time engagement.Predictive Analytics, And Real-Time Engagement (November 26, 2024) (2024)

2024

[28] [28]

Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2231–2240

2017

[29] [29]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

2023

[30] [30]

Vijay Mallik Reddy and Lakshmi Nivas Nalla. 2024. Personalization in e- commerce marketing: leveraging big data for tailored consumer engagement. Revista de Inteligencia Artificial en Medicina15, 1 (2024), 691–725

2024

[31] [31]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

[32] [32]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? How controllable attributes affect human judgments. In NAACL

2019

[34] [34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

2024

[35] [35]

Weiyan Shi, Yu Li, Saurav Sahay, and Zhou Yu. 2021. Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion Dialogues via Reinforcement Learn- ing and Human Demonstration. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3478–3492

2021

[36] [36]

Haoyu Song, Wei-Nan Zhang, Jingwen Hu, and Ting Liu. 2020. Generating per- sona consistent dialogues by exploiting natural language inference. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8878–8885

2020

[37] [37]

Zirui Song, Bin Yan, Yuhan Liu, Miao Fang, Mingzhe Li, Rui Yan, and Xiuying Chen. 2025. Injecting Domain-Specific Knowledge into Large Language Mod- els: A Comprehensive Survey. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Com...

2025

[38] [38]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback.Advances in neural information processing KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Mingzhe Li et al. systems33 (2020), 3008–3021

2020

[39] [39]

Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line active reward learning for policy optimisation in spoken dialogue systems. InACL

2016

[40] [40]

Haipeng Sun, Junwei Bao, Youzheng Wu, and Xiaodong He. 2023. Mars: Modeling Context & State Representations with Contrastive Learning for End-to-End Task- Oriented Dialog. InFindings of the Association for Computational Linguistics: ACL

2023

[41] [41]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022

[42] [42]

Heng-Da Xu, Xian-Ling Mao, Puhai Yang, Fanshu Sun, and He-Yan Huang. 2024. Rethinking task-oriented dialogue systems: From complex modularity to zero- shot autonomous agent. InProceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers). 2748–2763

2024

[43] [43]

Shiquan Yang, Rui Zhang, Sarah Erfani, and Jey Han Lau. 2022. An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4918–4935

2022

[44] [44]

Yunyi Yang, Yunhao Li, and Xiaojun Quan. 2021. Ubar: Towards fully end-to-end task-oriented dialog system with gpt-2. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 14230–14238

2021

[45] [45]

Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. Rrhf: Rank responses to align language models with human feedback.Advances in Neural Information Processing Systems36 (2023), 10935– 10950

2023

[46] [46]

Changshuo Zhang, Sirui Chen, Xiao Zhang, Sunhao Dai, Weijie Yu, and Jun Xu

[47] [47]

UOEP: User-Oriented Exploration Policy for Enhancing Long-Term User Experiences in Recommender Systems.arXiv preprint arXiv:2401.09034(2024)

work page arXiv 2024

[48] [48]

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too?. InACL

2018

[49] [49]

Xiaoying Zhang, Baolin Peng, Kun Li, Jingyan Zhou, and Helen Meng. 2023. SGP-TOD: Building Task Bots Effortlessly via Schema-Guided LLM Prompting. InFindings of the Association for Computational Linguistics: EMNLP 2023. 13348– 13369

2023

[50] [50]

If you have any questions, feel free to contact me

Zheng Zhang, Ryuichi Takanobu, Qi Zhu, MinLie Huang, and XiaoYan Zhu. 2020. Recent advances and challenges in task-oriented dialog systems.Science China Technological Sciences63, 10 (2020), 2011–2027. A Limitations WhileMOREdemonstrates strong performance in balancing multi- ple dialogue objectives, several limitations remain. First, our eval- uation is l...

2020