HARPO: Hierarchical Agentic Reasoning for User-Aligned Conversational Recommendation

Aman Vaibhav Jha; Mayank Anand; Sriparna Saha; Subham Raj

arxiv: 2604.10048 · v1 · submitted 2026-04-11 · 💻 cs.IR

HARPO: Hierarchical Agentic Reasoning for User-Aligned Conversational Recommendation

Subham Raj , Aman Vaibhav Jha , Mayank Anand , Sriparna Saha This is my paper

Pith reviewed 2026-05-10 16:13 UTC · model grok-4.3

classification 💻 cs.IR

keywords conversational recommender systemshierarchical preference learningvalue-guided tree searchmulti-dimensional qualityagentic reasoningvirtual tool operationsuser alignmentrecommendation optimization

0 comments

The pith

HARPO uses hierarchical preference learning and value-guided tree search to optimize conversational recommendations for multi-dimensional user quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that conversational recommender systems fall short when they optimize only for proxies such as retrieval accuracy or response fluency. HARPO instead treats recommendation as a decision process that first breaks quality into four dimensions—relevance, diversity, predicted user satisfaction, and engagement—then learns context-specific weights for those dimensions. A value network scores entire reasoning paths according to the weighted quality prediction rather than task completion, and virtual tool operations plus multi-agent refinement keep the reasoning transferable across domains. If the approach holds, systems would produce suggestions that better match actual user preferences in live conversations instead of just scoring well on static benchmarks.

Core claim

HARPO integrates hierarchical preference learning that decomposes recommendation quality into interpretable dimensions (relevance, diversity, predicted user satisfaction, and engagement) and learns context-dependent weights over these dimensions; deliberative tree-search reasoning guided by a learned value network that evaluates candidate reasoning paths based on predicted recommendation quality rather than task completion; and domain-agnostic reasoning abstractions through Virtual Tool Operations and multi-agent refinement, enabling transferable recommendation reasoning across domains.

What carries the argument

A learned value network that scores reasoning paths according to predicted multi-dimensional recommendation quality, paired with context-dependent weights on the four quality dimensions and virtual tool operations for abstraction.

If this is right

Consistent gains on recommendation-centric metrics across the ReDial, INSPIRED, and MUSE datasets.
Response quality remains competitive while recommendation alignment improves.
Virtual tool abstractions allow the same reasoning patterns to transfer across different recommendation domains.
Optimization targets end-to-end recommendation quality instead of intermediate goals such as retrieval accuracy or fluent generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hierarchical weighting of quality dimensions could be applied to other interactive decision tasks where success has multiple conflicting criteria.
The value network's accuracy would need ongoing calibration as user populations or conversation lengths change.
Extending the tree-search depth or adding more quality dimensions could be tested directly on the same evaluation setup.

Load-bearing premise

That the four quality dimensions together with the value network's predictions actually reflect what real users prefer in live conversations rather than simply correlating with the chosen proxy metrics on the test datasets.

What would settle it

A live user study in which participants converse with both HARPO and baseline systems and directly rate satisfaction and alignment; if ratings show no improvement or favor the baselines, the claim that the method optimizes for user-aligned quality would be falsified.

Figures

Figures reproduced from arXiv: 2604.10048 by Aman Vaibhav Jha, Mayank Anand, Sriparna Saha, Subham Raj.

**Figure 2.** Figure 2: Overall architecture of the HARPO framework. The model integrates four components: STAR for structured agentic reasoning, CHARM for hierarchical preference optimization, BRIDGE for cross-domain transfer, and MAVEN for multi-agent refinement, all built on a shared language model backbone. ing expected recommendation quality: θ ∗ = arg max θ EC,d [Q(rt , C) | πθ] (1) where vt ∈ V∗ is the predicted VTO sequen… view at source ↗

read the original abstract

Conversational recommender systems (CRSs) operate under incremental preference revelation, requiring systems to make recommendation decisions under uncertainty. While recent approaches particularly those built on large language models achieve strong performance on standard proxy metrics such as Recall@K and BLEU, they often fail to deliver high-quality, user-aligned recommendations in practice. This gap arises because existing methods primarily optimize for intermediate objectives like retrieval accuracy, fluent generation, or tool invocation, rather than recommendation quality itself. We propose HARPO (Hierarchical Agentic Reasoning with Preference Optimization), an agentic framework that reframes conversational recommendation as a structured decision-making process explicitly optimized for multi-dimensional recommendation quality. HARPO integrates hierarchical preference learning that decomposes recommendation quality into interpretable dimensions (relevance, diversity, predicted user satisfaction, and engagement) and learns context-dependent weights over these dimensions; (ii) deliberative tree-search reasoning guided by a learned value network that evaluates candidate reasoning paths based on predicted recommendation quality rather than task completion; and (iii) domain-agnostic reasoning abstractions through Virtual Tool Operations and multi-agent refinement, enabling transferable recommendation reasoning across domains. We evaluate HARPO on ReDial, INSPIRED, and MUSE, demonstrating consistent improvements over strong baselines on recommendation-centric metrics while maintaining competitive response quality. These results highlight the importance of explicit, user-aligned quality optimization for conversational recommendation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HARPO gives conversational recommenders a hierarchical quality decomposition and value-guided tree search, but the user-alignment claim rests on proxy metrics without independent validation.

read the letter

The core idea is to stop optimizing conversational recommenders for retrieval accuracy or fluency and instead decompose quality into relevance, diversity, predicted satisfaction, and engagement, then learn context-dependent weights over them. A value network scores entire reasoning paths during tree search, and virtual tool abstractions are meant to make the reasoning transferable across domains. That combination is not in the prior CRS work the abstract cites, and it directly targets the mismatch between standard metrics and actual recommendation quality that the field has complained about for years. The multi-dataset evaluation on ReDial, INSPIRED, and MUSE plus the claim of competitive response quality while improving recommendation metrics is the part that could matter to practitioners. The framework is concrete enough that someone could re-implement the tree search and virtual tools without too much guesswork. The soft spot is exactly the one the stress-test flags. The value network and dimension weights appear supervised on the same proxy signals used for final evaluation, with no reported human-in-the-loop studies or out-of-distribution user feedback to check whether the learned quality function actually tracks real user satisfaction. If the gains come mainly from better search rather than better alignment, the hierarchical preference learning part is not yet anchored. The abstract also gives no numbers, ablations, or statistical details, so the size of the improvement is still unclear. This is worth a serious referee for groups working on agentic dialogue or CRS who want a worked example of quality-guided search. It is not yet ready to cite as evidence that we have solved user-aligned recommendation, but the architecture is worth testing and extending. I would bring it to a reading group to discuss the value-network training details.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces HARPO, an agentic framework for conversational recommender systems that reframes recommendation under incremental preference revelation as explicit multi-dimensional quality optimization. It integrates (i) hierarchical preference learning that decomposes quality into relevance, diversity, predicted user satisfaction, and engagement with learned context-dependent weights; (ii) deliberative tree-search reasoning guided by a value network that scores paths on predicted quality rather than task completion; and (iii) domain-agnostic abstractions via Virtual Tool Operations and multi-agent refinement. Evaluations on ReDial, INSPIRED, and MUSE are reported to yield consistent gains on recommendation-centric metrics while preserving response quality.

Significance. If the value network and dimension weights demonstrably optimize for genuine user alignment beyond proxy correlations, and if the gains are robustly isolated to the proposed components, the work could meaningfully shift CRS research toward direct quality optimization with interpretable, transferable reasoning. The emphasis on multi-agent refinement and virtual tools for cross-domain applicability is a constructive direction.

major comments (2)

[§3.2] §3.2 (Value Network): The claim that the value network guides reasoning toward user-aligned recommendation quality is load-bearing for the deliberative tree-search contribution. However, the training appears to rely on the same proxy signals (e.g., Recall@K) used in final evaluation on ReDial/INSPIRED/MUSE, without reported human-in-the-loop validation or out-of-distribution user feedback. This leaves open the possibility that observed gains arise from more sophisticated search rather than improved alignment.
[§4] §4 (Experimental Evaluation): The central empirical claim of 'consistent improvements over strong baselines on recommendation-centric metrics' across three datasets is not supported by any reported quantitative values, baseline specifications, statistical tests, confidence intervals, or ablations isolating the hierarchical weights and value network. Without these, the evidence cannot substantiate the superiority or the contribution of the proposed mechanisms.

minor comments (1)

[Abstract] Abstract: The enumerated list of contributions begins with an unlabeled first item and then uses '(ii)' for the second component, creating a minor numbering inconsistency.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and will revise the manuscript to improve clarity and substantiation where feasible.

read point-by-point responses

Referee: [§3.2] §3.2 (Value Network): The claim that the value network guides reasoning toward user-aligned recommendation quality is load-bearing for the deliberative tree-search contribution. However, the training appears to rely on the same proxy signals (e.g., Recall@K) used in final evaluation on ReDial/INSPIRED/MUSE, without reported human-in-the-loop validation or out-of-distribution user feedback. This leaves open the possibility that observed gains arise from more sophisticated search rather than improved alignment.

Authors: We appreciate this observation on the value network. The network is trained to predict a composite quality score from the hierarchical preference model, which decomposes quality into relevance, diversity, predicted user satisfaction, and engagement with learned context-dependent weights; the objective is therefore to estimate path quality along these dimensions rather than task-completion proxies. Evaluation metrics such as Recall@K are used only for comparability with prior CRS work. We nevertheless acknowledge that the current training and evaluation lack human-in-the-loop validation or explicit OOD user feedback, leaving open the possibility that gains partly stem from more effective search. We will revise §3.2 to clarify the training objective and add a limitations subsection discussing this gap together with planned future user studies. revision: partial
Referee: [§4] §4 (Experimental Evaluation): The central empirical claim of 'consistent improvements over strong baselines on recommendation-centric metrics' across three datasets is not supported by any reported quantitative values, baseline specifications, statistical tests, confidence intervals, or ablations isolating the hierarchical weights and value network. Without these, the evidence cannot substantiate the superiority or the contribution of the proposed mechanisms.

Authors: We agree that the experimental section requires substantially more detail to support the claims. In the revised manuscript we will expand §4 to report all quantitative results (specific Recall@K, NDCG@K, and other recommendation-centric scores) for HARPO and each baseline across ReDial, INSPIRED, and MUSE; we will fully specify baseline implementations and hyperparameters; we will add statistical significance tests (paired t-tests with p-values), 95% confidence intervals, and expanded ablation tables that isolate the hierarchical weighting and value-network components. These changes will make the evidence for the proposed mechanisms explicit and verifiable. revision: yes

standing simulated objections not resolved

Conducting new human-in-the-loop validation or out-of-distribution user studies for the value network, which were outside the scope of the original experiments and would require additional resources and participant recruitment.

Circularity Check

0 steps flagged

No circularity detected; claims rest on external dataset evaluations without self-referential reductions

full rationale

The provided abstract and context describe HARPO as an agentic framework using hierarchical preference learning over dimensions like relevance and diversity, a value network for tree-search guidance, and virtual tool operations. No equations, derivations, or parameter-fitting steps are visible. The evaluation relies on standard external benchmarks (ReDial, INSPIRED, MUSE) with proxy metrics such as Recall@K, rather than any internal prediction that reduces by construction to fitted inputs or self-citations. The central claims about user-aligned optimization are presented as empirically tested improvements over baselines, with no load-bearing self-citation chains or ansatz smuggling that would create circularity. This is a normal non-finding for a framework paper whose value is assessed via independent dataset results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that recommendation quality decomposes cleanly into four fixed dimensions whose context-dependent weights can be learned to predict user alignment; no new physical entities are introduced.

free parameters (1)

context-dependent weights over quality dimensions
Learned weights for relevance, diversity, predicted satisfaction, and engagement that vary by conversation context.

axioms (1)

domain assumption Recommendation quality can be decomposed into the four interpretable dimensions of relevance, diversity, predicted user satisfaction, and engagement.
Invoked when defining the hierarchical preference learning component.

pith-pipeline@v0.9.0 · 5549 in / 1384 out tokens · 33597 ms · 2026-05-10T16:13:38.526345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, and Jie Tang. 2019. Towards knowledge-based recommender dialog system. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 1803--1813

work page 2019
[2]

Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 815--824

work page 2016
[3]

Tri Dao. 2024. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR)

work page 2024
[4]

Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180--1189

work page 2015
[5]

Chongming Gao, Wenqiang Lei, Xiangnan He, Maarten de Rijke, and Tat-Seng Chua. 2021. Advances and challenges in conversational recommender systems: A survey. AI Open, 2:100--126

work page 2021
[6]

Shirley Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu. 2020. Inspired: Toward sociable recommendation dialog systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8142--8152

work page 2020
[7]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations

work page 2022
[8]

Xu Huang et al. 2023. Recommender ai agent: Integrating large language models for interactive recommendations. In RecSys

work page 2023
[9]

Dietmar Jannach, Ahtsham Manzoor, Wanling Cai, and Li Chen. 2021. A survey on conversational recommender systems. ACM Computing Surveys, 54(5):1--36

work page 2021
[10]

Walid Krichene and Steffen Rendle. 2022. https://doi.org/10.1145/3535335 On sampled metrics for item recommendation . Commun. ACM, 65(7):75–83

work page doi:10.1145/3535335 2022
[11]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730--19742. PMLR

work page 2023
[12]

Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In Advances in Neural Information Processing Systems, volume 31

work page 2018
[13]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730--27744

work page 2022
[14]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2024. Toolllm: Facilitating large language models to master 16000+ real-world apis. In Proceedings of the International Conference on Learning Representations

work page 2024
[15]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, volume 36

work page 2023
[16]

Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. 2013. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67--113

work page 2013
[17]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems, volume 36

work page 2023
[18]

Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 235--244

work page 2018
[19]

Hugo Touvron et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Shuokai Wang, Yucheng Cai, Longping Huang, Luoyi Fang, and Xiaowei Chang. 2022 a . Barcor: Towards a unified framework for conversational recommendation with pretrained language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11703--11713

work page 2022
[21]

Xiaolei Wang, Kun Zhou, Ji-Rong Wen, and Wayne Xin Zhao. 2022 b . Towards unified conversational recommender systems via knowledge-enhanced prompt learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1929--1937

work page 2022
[22]

Xiaolei Wang, Kun Zhou, Ji-Rong Wen, and Wayne Xin Zhao. 2023. Knowledge-enhanced conversational recommendation via retrieval-augmented generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 1--15

work page 2023
[23]

Yancheng Wang et al. 2024. Recmind: Large language model powered agent for recommendation. In NAACL

work page 2024
[24]

Zihan Wang, Xiaocui Yang, Yongkang Liu, Shi Feng, Daling Wang, and Yifei Zhang. 2025. Muse: A multimodal conversational recommendation dataset with scenario-grounded user profiles. In Findings of the Association for Computational Linguistics: ACL 2025, pages 1027--1053

work page 2025
[25]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824--24837

work page 2022
[26]

Jianing Yang, Jiaqi Liu, Zongxin Wang, and Guoyu Chen. 2023. Multi-modal semantic graph for conversational recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 4655--4663

work page 2023
[27]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36

work page 2023
[28]

Kun Zhou, Wayne Xin Zhao, Shuqing Bian, Yuanhang Zhou, Ji-Rong Wen, and Jingsong Yu. 2020. Improving conversational recommender systems via knowledge graph based semantic fusion. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1006--1014

work page 2020
[29]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[30]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, and Jie Tang. 2019. Towards knowledge-based recommender dialog system. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 1803--1813

work page 2019

[2] [2]

Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 815--824

work page 2016

[3] [3]

Tri Dao. 2024. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR)

work page 2024

[4] [4]

Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180--1189

work page 2015

[5] [5]

Chongming Gao, Wenqiang Lei, Xiangnan He, Maarten de Rijke, and Tat-Seng Chua. 2021. Advances and challenges in conversational recommender systems: A survey. AI Open, 2:100--126

work page 2021

[6] [6]

Shirley Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu. 2020. Inspired: Toward sociable recommendation dialog systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8142--8152

work page 2020

[7] [7]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations

work page 2022

[8] [8]

Xu Huang et al. 2023. Recommender ai agent: Integrating large language models for interactive recommendations. In RecSys

work page 2023

[9] [9]

Dietmar Jannach, Ahtsham Manzoor, Wanling Cai, and Li Chen. 2021. A survey on conversational recommender systems. ACM Computing Surveys, 54(5):1--36

work page 2021

[10] [10]

Walid Krichene and Steffen Rendle. 2022. https://doi.org/10.1145/3535335 On sampled metrics for item recommendation . Commun. ACM, 65(7):75–83

work page doi:10.1145/3535335 2022

[11] [11]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730--19742. PMLR

work page 2023

[12] [12]

Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In Advances in Neural Information Processing Systems, volume 31

work page 2018

[13] [13]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730--27744

work page 2022

[14] [14]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2024. Toolllm: Facilitating large language models to master 16000+ real-world apis. In Proceedings of the International Conference on Learning Representations

work page 2024

[15] [15]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, volume 36

work page 2023

[16] [16]

Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. 2013. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67--113

work page 2013

[17] [17]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems, volume 36

work page 2023

[18] [18]

Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 235--244

work page 2018

[19] [19]

Hugo Touvron et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Shuokai Wang, Yucheng Cai, Longping Huang, Luoyi Fang, and Xiaowei Chang. 2022 a . Barcor: Towards a unified framework for conversational recommendation with pretrained language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11703--11713

work page 2022

[21] [21]

Xiaolei Wang, Kun Zhou, Ji-Rong Wen, and Wayne Xin Zhao. 2022 b . Towards unified conversational recommender systems via knowledge-enhanced prompt learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1929--1937

work page 2022

[22] [22]

Xiaolei Wang, Kun Zhou, Ji-Rong Wen, and Wayne Xin Zhao. 2023. Knowledge-enhanced conversational recommendation via retrieval-augmented generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 1--15

work page 2023

[23] [23]

Yancheng Wang et al. 2024. Recmind: Large language model powered agent for recommendation. In NAACL

work page 2024

[24] [24]

Zihan Wang, Xiaocui Yang, Yongkang Liu, Shi Feng, Daling Wang, and Yifei Zhang. 2025. Muse: A multimodal conversational recommendation dataset with scenario-grounded user profiles. In Findings of the Association for Computational Linguistics: ACL 2025, pages 1027--1053

work page 2025

[25] [25]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824--24837

work page 2022

[26] [26]

Jianing Yang, Jiaqi Liu, Zongxin Wang, and Guoyu Chen. 2023. Multi-modal semantic graph for conversational recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 4655--4663

work page 2023

[27] [27]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36

work page 2023

[28] [28]

Kun Zhou, Wayne Xin Zhao, Shuqing Bian, Yuanhang Zhou, Ji-Rong Wen, and Jingsong Yu. 2020. Improving conversational recommender systems via knowledge graph based semantic fusion. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1006--1014

work page 2020

[29] [29]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[30] [30]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page