pith. machine review for the scientific record. sign in

arxiv: 2605.12995 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:48 UTC · model grok-4.3

classification 💻 cs.LG
keywords policy optimizationcandidate generationrankinglarge language modelsreinforcement learningsequential recommendationmulti-hop question answering
0
0 comments X

The pith

F-GRPO lets one LLM jointly generate candidates and rank them by factorizing policy optimization into separate phases with distinct advantages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the credit assignment problem that arises when an LLM must produce an ordered list of candidates in a single autoregressive pass yet receives only a final utility score. Standard group-relative policy optimization cannot tell whether poor results stem from missing relevant items or from ordering them badly. F-GRPO factorizes the policy into a generation phase and a ranking phase, applies an order-invariant coverage reward to the first and a position-aware utility reward to the second, and supplies each phase with its own group-relative advantage inside a shared two-phase sequence objective. The resulting unified training improves top-ranked performance over both GRPO and decoupled baselines while remaining competitive with strong zero-shot rerankers and requiring no architectural changes at inference time.

Core claim

By factorizing the policy into candidate generation and ranking while sharing a single LLM backbone, and by applying separate group-relative advantages to each phase inside a two-phase sequence-level objective, the model can optimize both the selection of relevant candidates and their correct ordering against downstream utility signals in a single end-to-end rollout.

What carries the argument

Factorized Group-Relative Policy Optimization (F-GRPO), which decomposes the sequence-level objective into generation and ranking phases, supplies each with its own group-relative advantage, and combines an order-invariant coverage reward with a position-aware utility reward.

If this is right

  • Top-ranked performance rises over both standard GRPO and separately trained generation-then-ranking pipelines on sequential recommendation and multi-hop QA tasks.
  • The method outperforms supervised fine-tuning baselines while staying competitive with strong zero-shot rerankers.
  • No changes to model architecture or inference procedure are required after training.
  • End-to-end optimization aligns generation and ranking directly with final utility rather than with intermediate retrieval metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same phase-factorization pattern could apply to other composite generative tasks such as multi-step planning followed by execution.
  • Because the backbone remains unchanged, the approach may scale to larger models without doubling parameter count at deployment.
  • If the two-phase objective proves stable, similar factorization might reduce the need for hand-crafted multi-stage pipelines in retrieval-augmented generation.

Load-bearing premise

The credit assignment problem between missing good candidates and mis-ordering them can be solved simply by giving each phase its own group-relative advantage while keeping the LLM backbone shared.

What would settle it

A controlled experiment in which the unified model is forced to generate the same high-coverage candidate set as a strong decoupled baseline yet still produces lower final utility after ranking, or the reverse case where ranking quality stays fixed but generation quality drops.

Figures

Figures reproduced from arXiv: 2605.12995 by Bowen Jin, Gagan Mundada, Jiawei Han, Jingbo Shang, Julian McAuley, Junda Wu, Ritwik Sinha, Rohan Surana, Sizhe Zhou, Tong Yu, Xintong Li, Yizhu Jiao.

Figure 1
Figure 1. Figure 1: (a) Black-box ranking conflates candidate selection and ordering, yielding ambiguous credit assignment. (b) Factor￾ized in-context generation and ranking with phase-specific goals within a single autore￾gressive rollout. We make the list-to-rank decision explicit within a single LLM rollout through in-context exploration. The model first constructs a can￾didate slate and then ranks that slate within the sa… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics on LastFM. (a) Slate reward ablation. (b) The slate generator [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Precision–recall redistribution between the slate and ranker on LastFM across two [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter sensitivity for F-GRPO. (a) [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation metrics during training for F-GRPO and GRPO on Qwen3-4B. F-GRPO [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
read the original abstract

Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full ranked list is generated. Because this feedback is defined over the completed sequence, it cannot distinguish whether a poor result arises from failing to generate a relevant subset or from failing to rank that subset correctly. This credit assignment gap makes end-to-end optimization unstable and sample-inefficient. Existing systems often address this by separating candidate generation from ranking. However, such decoupling remains misaligned with downstream utility because ranking is limited by the candidate set it receives. To bridge this gap, we propose a unified framework that performs both within a single autoregressive rollout and optimizes them end-to-end via factorized group-relative policy optimization (F-GRPO). Our framework factorizes the policy into candidate generation and ranking while sharing a single LLM backbone, and jointly trains them with an order-invariant coverage reward and a position-aware utility reward. To address the resulting phase-specific credit assignment problem, we use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective. Across sequential recommendation and multi-hop question answering benchmarks, F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers, with no architectural changes at inference time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces F-GRPO, a factorized extension of group-relative policy optimization that unifies candidate generation and ranking inside a single autoregressive LLM rollout. It factorizes the policy into generation and ranking components that share one backbone, optimizes them jointly with an order-invariant coverage reward and a position-aware utility reward, and resolves the resulting credit-assignment problem by applying separate group-relative advantages inside a two-phase sequence-level objective. Experiments on sequential recommendation and multi-hop QA benchmarks report that F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers without any inference-time architectural changes.

Significance. If the factorization and phase-specific advantages can be shown to remain unbiased under the shared autoregressive coupling, the method would offer a practical route to end-to-end optimization of generative ranking pipelines that currently rely on staged retrieval-plus-reranking. The absence of inference-time overhead and the reported gains over both GRPO and supervised baselines would make the framework relevant to recommendation and retrieval-augmented generation systems.

major comments (2)
  1. [Method / two-phase objective] The central technical claim—that separate group-relative advantages for the generation and ranking phases remain unbiased inside a single contiguous autoregressive sequence—requires an explicit derivation or proof. The abstract states that the two-phase objective addresses the credit-assignment gap, yet no argument is supplied showing that gradients from the position-aware utility reward do not leak back through the shared backbone into generation decisions when the phase boundary is realized only by formatting conventions or token masking.
  2. [Experiments] The experimental section must include an ablation that isolates the effect of the factorization itself (i.e., F-GRPO versus a non-factorized GRPO baseline that uses the same two-phase formatting). Without this ablation, it is impossible to attribute the reported gains to the proposed advantage separation rather than to the joint training or the choice of rewards.
minor comments (2)
  1. [Experiments] The abstract refers to “sequential recommendation and multi-hop question answering benchmarks” without naming the concrete datasets or reporting the number of runs and statistical significance tests; these details should be added to the experimental section.
  2. [Method] Notation for the generation-phase and ranking-phase advantages should be introduced with explicit equations rather than descriptive text, to make the two-phase objective reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We appreciate the recognition of F-GRPO's potential for end-to-end optimization of generative ranking pipelines without inference-time overhead. We address each major comment below and will revise the manuscript to strengthen the technical presentation and experimental validation.

read point-by-point responses
  1. Referee: [Method / two-phase objective] The central technical claim—that separate group-relative advantages for the generation and ranking phases remain unbiased inside a single contiguous autoregressive sequence—requires an explicit derivation or proof. The abstract states that the two-phase objective addresses the credit-assignment gap, yet no argument is supplied showing that gradients from the position-aware utility reward do not leak back through the shared backbone into generation decisions when the phase boundary is realized only by formatting conventions or token masking.

    Authors: We acknowledge that the manuscript would benefit from an explicit derivation showing that the phase-specific group-relative advantages remain unbiased under the shared autoregressive backbone. In the revised version we will add a dedicated subsection deriving the policy gradient for the two-phase objective. The derivation will demonstrate that, by computing separate advantages over masked phase-specific tokens and applying the group-relative baseline within each phase, gradients from the position-aware utility reward are isolated to the ranking tokens and do not propagate back to generation decisions. revision: yes

  2. Referee: [Experiments] The experimental section must include an ablation that isolates the effect of the factorization itself (i.e., F-GRPO versus a non-factorized GRPO baseline that uses the same two-phase formatting). Without this ablation, it is impossible to attribute the reported gains to the proposed advantage separation rather than to the joint training or the choice of rewards.

    Authors: We agree that an ablation isolating the factorization and advantage separation is necessary. We will add this experiment in the revised manuscript: a non-factorized GRPO baseline that uses identical two-phase formatting and the same order-invariant coverage plus position-aware utility rewards, but applies a single group-relative advantage over the entire sequence. Performance differences versus F-GRPO will be reported on both the sequential recommendation and multi-hop QA benchmarks to attribute gains specifically to the phase-specific advantages. revision: yes

Circularity Check

0 steps flagged

No significant circularity; F-GRPO defined as new factorized objective from credit-assignment setup

full rationale

The paper introduces F-GRPO as an explicit new optimization framework that factorizes a single autoregressive policy into generation and ranking phases, using separate group-relative advantages inside a two-phase sequence-level objective along with order-invariant coverage and position-aware utility rewards. This construction is presented directly from the stated credit-assignment gap in the abstract, without any quoted equations or self-citations that reduce the claimed separation or performance gains back to fitted inputs or prior results by definition. The central claim therefore remains an independent modeling choice rather than a renaming or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach rests on standard RL concepts (group-relative advantages, sequence-level rewards) and the assumption that factorization resolves credit assignment.

pith-pipeline@v0.9.0 · 5639 in / 1064 out tokens · 52402 ms · 2026-05-14T19:48:31.789661+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 6 internal anchors

  1. [1]

    Ellis, Brian Whitman, and Paul Lamere

    Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval ( ISMIR 2011) , 2011

  2. [2]

    Autoregressive search engines: Generating substrings as document identifiers

    Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Scott Yih, Sebastian Riedel, and Fabio Petroni. Autoregressive search engines: Generating substrings as document identifiers. Advances in Neural Information Processing Systems, 35: 0 31668--31683, 2022

  3. [3]

    Generative slate recommendation with reinforcement learning

    Romain Deffayet, Thibaut Thonet, Jean-Michel Renders, and Maarten de Rijke. Generative slate recommendation with reinforcement learning. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM '23, pp.\ 580–588, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394079. doi:10.1145/3539597.35...

  4. [4]

    Chang, Claire Cardie, Kianté Brantley, and Thorsten Joachim

    Ge Gao, Jonathan D. Chang, Claire Cardie, Kianté Brantley, and Thorsten Joachim. Policy-gradient training of language models for ranking, 2024. URL https://arxiv.org/abs/2310.04407

  5. [5]

    R e2 G : Retrieve, rerank, generate

    Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Naik, Pengshan Cai, and Alfio Gliozzo. R e2 G : Retrieve, rerank, generate. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...

  6. [6]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

  7. [7]

    Towards two-stage counterfactual learning to rank

    Shashank Gupta, Yiming Liao, and Maarten de Rijke. Towards two-stage counterfactual learning to rank. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), ICTIR '25, pp.\ 177–182, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400718618. doi:10.1145/3731...

  8. [8]

    Maxwell Harper and Joseph A

    F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5 0 (4), December 2015. ISSN 2160-6455. doi:10.1145/2827872. URL https://doi.org/10.1145/2827872

  9. [9]

    Session-based Recommendations with Recurrent Neural Networks

    Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks, 2016. URL https://arxiv.org/abs/1511.06939

  10. [10]

    Towards universal sequence representation learning for recommender systems

    Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '22, pp.\ 585–593, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393850. doi:10...

  11. [11]

    Large language models are zero-shot rankers for recommender systems

    Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. Large language models are zero-shot rankers for recommender systems. In Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part II, pp.\ 364–381, Berlin, Heidelberg, 202...

  12. [12]

    Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=NFM8F5cV0V

  13. [13]

    Interactive visualization recommendation with hier-sucb

    Songwen Hu, Ryan A Rossi, Tong Yu, Junda Wu, Handong Zhao, Sungchul Kim, and Shuai Li. Interactive visualization recommendation with hier-sucb. In Proceedings of the ACM on Web Conference 2025, pp.\ 313--321, 2025 b

  14. [14]

    A survey of foundation model-powered recommender systems: From feature-based, generative to agentic paradigms

    Chengkai Huang, Hongtao Huang, Tong Yu, Kaige Xie, Junda Wu, Shuai Zhang, Julian Mcauley, Dietmar Jannach, and Lina Yao. A survey of foundation model-powered recommender systems: From feature-based, generative to agentic paradigms. arXiv preprint arXiv:2504.16420, 2025 a

  15. [15]

    Towards agentic recommender systems in the era of multimodal large language models

    Chengkai Huang, Junda Wu, Yu Xia, Zixu Yu, Ruhan Wang, Tong Yu, Ruiyi Zhang, Ryan A Rossi, Branislav Kveton, Dongruo Zhou, et al. Towards agentic recommender systems in the era of multimodal large language models. arXiv preprint arXiv:2503.16734, 2025 b

  16. [16]

    Pluralistic off-policy evaluation and alignment

    Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, and Lina Yao. Pluralistic off-policy evaluation and alignment. arXiv preprint arXiv:2509.19333, 2025 c

  17. [17]

    Listwise preference diffusion optimization for user behavior trajectories prediction

    Hongtao Huang, Chengkai Huang, Junda Wu, Tong Yu, Julian McAuley, and Lina Yao. Listwise preference diffusion optimization for user behavior trajectories prediction. Advances in Neural Information Processing Systems, 38: 0 159383--159408, 2026 a

  18. [18]

    Image difference captioning via adversarial preference optimization

    Zihan Huang, Junda Wu, Rohan Surana, Tong Yu, David Arbour, Ritwik Sinha, and Julian McAuley. Image difference captioning via adversarial preference optimization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 33746--33758, 2025 d

  19. [19]

    Evaluation on entity matching in recommender systems

    Zihan Huang, Rohan Surana, Zhouhang Xie, Junda Wu, Yu Xia, and Julian McAuley. Evaluation on entity matching in recommender systems. arXiv preprint arXiv:2601.17218, 2026 b

  20. [20]

    Active learning for direct preference optimization

    Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, and Tong Yu. Active learning for direct preference optimization. arXiv preprint arXiv:2503.01076, 2025

  21. [21]

    A personalized conversational benchmark: Towards simulating personalized conversations

    Li Li, Peilin Cai, Ryan A Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, et al. A personalized conversational benchmark: Towards simulating personalized conversations. arXiv preprint arXiv:2505.14106, 2025 a

  22. [22]

    Importance sampling for multi-negative multimodal direct preference optimization

    Xintong Li, Chuhan Wang, Junda Wu, Rohan Surana, Tong Yu, Julian McAuley, and Jingbo Shang. Importance sampling for multi-negative multimodal direct preference optimization. arXiv preprint arXiv:2509.25717, 2025 b

  23. [23]

    Ract: Ranking-aware chain-of-thought optimization for llms

    Haowei Liu, Xuyang Wu, Guohao Sun, Hsin-Tai Wu, Zhiqiang Tao, and Yi Fang. Ract: Ranking-aware chain-of-thought optimization for llms. In Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2025, pp.\ 178–188, New York, NY, USA, 2025 a . Association for...

  24. [24]

    Learning to rank for information retrieval

    Tie-Yan Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval , 3 0 (3): 0 225--331, 2009

  25. [25]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. In Second Conference on Language Modeling, 2025 b . URL https://openreview.net/forum?id=5PAF7PAY2Y

  26. [26]

    Recranker: Instruction tuning large language model as ranker for top-k recommendation

    Sichun Luo, Bowei He, Haohan Zhao, Wei Shao, Yanlin Qi, Yinya Huang, Aojun Zhou, Yuxuan Yao, Zongpeng Li, Yuanzhang Xiao, Mingjie Zhan, and Linqi Song. Recranker: Instruction tuning large language model as ranker for top-k recommendation. ACM Trans. Inf. Syst., 43 0 (5), July 2025. ISSN 1046-8188. doi:10.1145/3705728. URL https://doi.org/10.1145/3705728

  27. [27]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention, 2025. URL https://arxiv.org/abs/2506.13585

  28. [28]

    Ws-grpo: Weakly-supervised group-relative policy optimization for rollout-efficient reasoning

    Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, et al. Ws-grpo: Weakly-supervised group-relative policy optimization for rollout-efficient reasoning. arXiv preprint arXiv:2602.17025, 2026

  29. [29]

    Large language models for conversational user simulation: A comprehensive survey

    Bo Ni, Leyao Wang, Yu Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Leura, Samyadeep Basu, Subhojyoti Mukherjee, et al. Large language models for conversational user simulation: A comprehensive survey. 2025

  30. [30]

    A survey on llm-based conversational user simulation

    Bo Ni, Yu Wang, Leyao Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Luera, Samyadeep Basu, Subhojyoti Mukherjee, et al. A survey on llm-based conversational user simulation. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4266--4301, 2026

  31. [31]

    Document ranking with a pretrained sequence-to-sequence model

    Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. Document ranking with a pretrained sequence-to-sequence model. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 708--718, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-em...

  32. [32]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  33. [33]

    Higr: Efficient generative slate recommendation via hierarchical planning and multi-objective preference alignment, 2026

    Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Hongyong Yu, Chengxiang Zhuo, and Zang Li. Higr: Efficient generative slate recommendation via hierarchical planning and multi-objective preference alignment, 2026. URL https://arxiv.org/abs/2512.24787

  34. [34]

    The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, 2021

    Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, 2021. URL https://arxiv.org/abs/2101.05667

  35. [35]

    RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, 2023. URL https://arxiv.org/abs/2312.02724

  36. [36]

    Qwen3.5 : Towards native multimodal agents, February 2026

    Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  37. [37]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA, 2023. Curran Associates Inc

  38. [38]

    The probabilistic relevance framework: Bm25 and beyond

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3 0 (4): 0 333–389, April 2009. ISSN 1554-0669. doi:10.1561/1500000019. URL https://doi.org/10.1561/1500000019

  39. [39]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

  40. [40]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  41. [41]

    Rankllm: A python package for reranking with llms

    Sahel Sharifymoghaddam, Ronak Pradeep, Andre Slavescu, Ryan Nguyen, Andrew Xu, Zijian Chen, Yilin Zhang, Yidi Chen, Jasper Xian, and Jimmy Lin. Rankllm: A python package for reranking with llms. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '25, pp.\ 3681–3690, New York, NY, USA, ...

  42. [42]

    Is C hat GPT good at search? investigating large language models as re-ranking agents

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is C hat GPT good at search? investigating large language models as re-ranking agents. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 14918--14937, Si...

  43. [43]

    From reviews to dialogues: Active synthesis for zero-shot llm-based conversational recommender system

    Rohan Surana, Junda Wu, Zhouhang Xie, Yu Xia, Harald Steck, Dawen Liang, Nathan Kallus, and Julian McAuley. From reviews to dialogues: Active synthesis for zero-shot llm-based conversational recommender system. arXiv preprint arXiv:2504.15476, 2025

  44. [44]

    Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, Nikki Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan Rossi, Kuan-Hao Huang, Jingbo Shang, Jiawei Han, and Julian McAuley. Generate, filter, control, replay: A comprehensive survey of rollout s...

  45. [45]

    Maximum likelihood reinforcement learning, 2026

    Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning, 2026. URL https://arxiv.org/abs/2602.02710

  46. [46]

    Scaling down, litting up: Efficient zero-shot listwise reranking with seq2seq encoder-decoder models, 2023

    Manveer Singh Tamber, Ronak Pradeep, and Jimmy Lin. Scaling down, litting up: Efficient zero-shot listwise reranking with seq2seq encoder-decoder models, 2023. URL https://arxiv.org/abs/2312.16098

  47. [47]

    Listwise generative retrieval models via a sequential learning process

    Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, and Xueqi Cheng. Listwise generative retrieval models via a sequential learning process. ACM Trans. Inf. Syst., 42 0 (5), April 2024. ISSN 1046-8188. doi:10.1145/3653712. URL https://doi.org/10.1145/3653712

  48. [48]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

  49. [49]

    M u S i Q ue: Multihop questions via single-hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. M u S i Q ue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022. doi:10.1162/tacl_a_00475. URL https://aclanthology.org/2022.tacl-1.31/

  50. [50]

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\...

  51. [51]

    Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes

    Chuhan Wang, Xintong Li, Jennifer Yuntong Zhang, Junda Wu, Chengkai Huang, Lina Yao, Julian McAuley, and Jingbo Shang. Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes. arXiv preprint arXiv:2601.05600, 2026

  52. [52]

    arXiv preprint arXiv:2304.03153 , year=

    Lei Wang and Ee-Peng Lim. Zero-shot next-item recommendation using large pretrained language models, 2023. URL https://arxiv.org/abs/2304.03153

  53. [53]

    A neural corpus indexer for document retrieval

    Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al. A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems, 35: 0 25600--25614, 2022

  54. [54]

    Ctrls: Chain-of-thought reasoning via latent state-transition

    Junda Wu, Yuxin Xiong, Xintong Li, Sheldon Yu, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, and Julian McAuley. Ctrls: Chain-of-thought reasoning via latent state-transition. In The 29th International Conference on Artificial Intelligence and Statistics

  55. [55]

    Deconfounded and explainable interactive vision-language retrieval of complex scenes

    Junda Wu, Tong Yu, and Shuai Li. Deconfounded and explainable interactive vision-language retrieval of complex scenes. MM '21, pp.\ 2103–2111, New York, NY, USA, 2021 a . Association for Computing Machinery. ISBN 9781450386517. doi:10.1145/3474085.3475366. URL https://doi.org/10.1145/3474085.3475366

  56. [56]

    Clustering of conversational bandits for user preference learning and elicitation

    Junda Wu, Canzhe Zhao, Tong Yu, Jingyang Li, and Shuai Li. Clustering of conversational bandits for user preference learning and elicitation. CIKM '21, pp.\ 2129–2139, New York, NY, USA, 2021 b . Association for Computing Machinery. ISBN 9781450384469. doi:10.1145/3459637.3482328. URL https://doi.org/10.1145/3459637.3482328

  57. [57]

    Dynamics-aware adaptation for reinforcement learning based cross-domain interactive recommendation

    Junda Wu, Zhihui Xie, Tong Yu, Handong Zhao, Ruiyi Zhang, and Shuai Li. Dynamics-aware adaptation for reinforcement learning based cross-domain interactive recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, pp.\ 290–300, New York, NY, USA, 2022. Association for Com...

  58. [58]

    Coral: Collaborative retrieval-augmented large language models improve long-tail recommendation

    Junda Wu, Cheng-Chun Chang, Tong Yu, Zhankui He, Jianing Wang, Yupeng Hou, and Julian McAuley. Coral: Collaborative retrieval-augmented large language models improve long-tail recommendation. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp.\ 3391--3401, 2024 a

  59. [59]

    Personalized multimodal large language models: A survey.arXiv:2412.02142, 2024

    Junda Wu, Hanjia Lyu, Yu Xia, Zhehao Zhang, Joe Barrow, Ishita Kumar, Mehrnoosh Mirtaheri, Hongjie Chen, Ryan A Rossi, Franck Dernoncourt, et al. Personalized multimodal large language models: A survey. arXiv preprint arXiv:2412.02142, 2024 b

  60. [60]

    Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention

    Junda Wu, Tong Yu, Xiang Chen, Haoliang Wang, Ryan Rossi, Sungchul Kim, Anup Rao, and Julian McAuley. Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14073--14087, 2024 c

  61. [61]

    Collap: Contrastive long-form language-audio pretraining with musical temporal structure augmentation

    Junda Wu, Warren Li, Zachary Novack, Amit Namburi, Carol Chen, and Julian McAuley. Collap: Contrastive long-form language-audio pretraining with musical temporal structure augmentation. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025 a

  62. [62]

    Ocean: Offline chain-of-thought evaluation and alignment in large language models

    Junda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, et al. Ocean: Offline chain-of-thought evaluation and alignment in large language models. In International Conference on Learning Representations, volume 2025, pp.\ 100570--100589, 2025 b

  63. [63]

    Rossi, Prithviraj Ammanabrolu, and Julian McAuley

    Junda Wu, Rohan Surana, Zhouhang Xie, Yiran Shen, Yu Xia, Tong Yu, Ryan A. Rossi, Prithviraj Ammanabrolu, and Julian McAuley. In-context ranking preference optimization. In Second Conference on Language Modeling, 2025 c . URL https://openreview.net/forum?id=L2NPhLAKEd

  64. [64]

    Doc-react: Multi-page heterogeneous document question-answering

    Junda Wu, Yu Xia, Tong Yu, Xiang Chen, Sai Sree Harsha, Akash V Maharaj, Ruiyi Zhang, Victor Bursztyn, Sungchul Kim, Ryan A Rossi, et al. Doc-react: Multi-page heterogeneous document question-answering. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 67--78, 2025 d

  65. [65]

    Sand: Boosting llm agents with self-taught action deliberation

    Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, and Julian McAuley. Sand: Boosting llm agents with self-taught action deliberation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 3062--3077, 2025 a

  66. [66]

    Knowledge-aware query expansion with large language models for textual and relational retrieval

    Yu Xia, Junda Wu, Sungchul Kim, Tong Yu, Ryan A Rossi, Haoliang Wang, and Julian McAuley. Knowledge-aware query expansion with large language models for textual and relational retrieval. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long...

  67. [67]

    A survey on personalized and pluralistic preference alignment in large language models

    Zhouhang Xie, Junda Wu, Yiran Shen, Raghav Jain, Yu Xia, Xintong Li, Aaron Chang, Ryan A Rossi, Tong Yu, Sachin Kumar, et al. A survey on personalized and pluralistic preference alignment in large language models. In Second Conference on Language Modeling

  68. [68]

    Neighborhood-based collaborative filtering for conversational recommendation

    Zhouhang Xie, Junda Wu, Hyunsik Jeon, Zhankui He, Harald Steck, Rahul Jha, Dawen Liang, Nathan Kallus, and Julian McAuley. Neighborhood-based collaborative filtering for conversational recommendation. In Proceedings of the 18th ACM Conference on Recommender Systems, pp.\ 1045--1050, 2024

  69. [69]

    List items one by one: A new data source and learning paradigm for multimodal llms

    An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, et al. List items one by one: A new data source and learning paradigm for multimodal llms. In First Conference on Language Modeling

  70. [70]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. H otpot QA : A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proce...

  71. [71]

    DAPO : An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

  72. [72]

    Explainable chain-of-thought reasoning: An empirical analysis on state-aware reasoning dynamics

    Sheldon Yu, Yuxin Xiong, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, and Julian McAuley. Explainable chain-of-thought reasoning: An empirical analysis on state-aware reasoning dynamics. arXiv preprint arXiv:2509.00190, 2025 b

  73. [73]

    Llamarec: Two-stage recommendation using large language models for ranking, 2023

    Zhenrui Yue, Sara Rabhi, Gabriel de Souza Pereira Moreira, Dong Wang, and Even Oldridge. Llamarec: Two-stage recommendation using large language models for ranking, 2023. URL https://arxiv.org/abs/2311.02089

  74. [74]

    Gvpo: Group variance policy optimization for large language model post-training, 2025

    Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, and Hui Xiong. Gvpo: Group variance policy optimization for large language model post-training, 2025. URL https://arxiv.org/abs/2504.19599

  75. [75]

    Rank-grpo: Training llm-based conversational recommender systems with reinforcement learning, 2026

    Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Vito Ostuni, Jundong Li, and Nathan Kallus. Rank-grpo: Training llm-based conversational recommender systems with reinforcement learning, 2026. URL https://arxiv.org/abs/2510.20150

  76. [76]

    Rank-r1: Enhancing reasoning in llm-based document rerankers via reinforcement learning, 2025

    Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon. Rank-r1: Enhancing reasoning in llm-based document rerankers via reinforcement learning, 2025. URL https://arxiv.org/abs/2503.06034