arxiv: 2605.12995 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

Rohan Surana , Gagan Mundada , Junda Wu , Xintong Li , Yizhu Jiao , Bowen Jin , Sizhe Zhou , Tong Yu

show 4 more authors

Ritwik Sinha Jiawei Han Jingbo Shang Julian McAuley

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords policy optimizationcandidate generationrankinglarge language modelsreinforcement learningsequential recommendationmulti-hop question answering

0 comments

The pith

F-GRPO lets one LLM jointly generate candidates and rank them by factorizing policy optimization into separate phases with distinct advantages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the credit assignment problem that arises when an LLM must produce an ordered list of candidates in a single autoregressive pass yet receives only a final utility score. Standard group-relative policy optimization cannot tell whether poor results stem from missing relevant items or from ordering them badly. F-GRPO factorizes the policy into a generation phase and a ranking phase, applies an order-invariant coverage reward to the first and a position-aware utility reward to the second, and supplies each phase with its own group-relative advantage inside a shared two-phase sequence objective. The resulting unified training improves top-ranked performance over both GRPO and decoupled baselines while remaining competitive with strong zero-shot rerankers and requiring no architectural changes at inference time.

Core claim

By factorizing the policy into candidate generation and ranking while sharing a single LLM backbone, and by applying separate group-relative advantages to each phase inside a two-phase sequence-level objective, the model can optimize both the selection of relevant candidates and their correct ordering against downstream utility signals in a single end-to-end rollout.

What carries the argument

Factorized Group-Relative Policy Optimization (F-GRPO), which decomposes the sequence-level objective into generation and ranking phases, supplies each with its own group-relative advantage, and combines an order-invariant coverage reward with a position-aware utility reward.

If this is right

Top-ranked performance rises over both standard GRPO and separately trained generation-then-ranking pipelines on sequential recommendation and multi-hop QA tasks.
The method outperforms supervised fine-tuning baselines while staying competitive with strong zero-shot rerankers.
No changes to model architecture or inference procedure are required after training.
End-to-end optimization aligns generation and ranking directly with final utility rather than with intermediate retrieval metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same phase-factorization pattern could apply to other composite generative tasks such as multi-step planning followed by execution.
Because the backbone remains unchanged, the approach may scale to larger models without doubling parameter count at deployment.
If the two-phase objective proves stable, similar factorization might reduce the need for hand-crafted multi-stage pipelines in retrieval-augmented generation.

Load-bearing premise

The credit assignment problem between missing good candidates and mis-ordering them can be solved simply by giving each phase its own group-relative advantage while keeping the LLM backbone shared.

What would settle it

A controlled experiment in which the unified model is forced to generate the same high-coverage candidate set as a strong decoupled baseline yet still produces lower final utility after ranking, or the reverse case where ranking quality stays fixed but generation quality drops.

Figures

Figures reproduced from arXiv: 2605.12995 by Bowen Jin, Gagan Mundada, Jiawei Han, Jingbo Shang, Julian McAuley, Junda Wu, Ritwik Sinha, Rohan Surana, Sizhe Zhou, Tong Yu, Xintong Li, Yizhu Jiao.

**Figure 1.** Figure 1: (a) Black-box ranking conflates candidate selection and ordering, yielding ambiguous credit assignment. (b) Factorized in-context generation and ranking with phase-specific goals within a single autoregressive rollout. We make the list-to-rank decision explicit within a single LLM rollout through in-context exploration. The model first constructs a candidate slate and then ranks that slate within the sa… view at source ↗

**Figure 2.** Figure 2: Training dynamics on LastFM. (a) Slate reward ablation. (b) The slate generator [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Precision–recall redistribution between the slate and ranker on LastFM across two [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Hyperparameter sensitivity for F-GRPO. (a) [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation metrics during training for F-GRPO and GRPO on Qwen3-4B. F-GRPO [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

read the original abstract

Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full ranked list is generated. Because this feedback is defined over the completed sequence, it cannot distinguish whether a poor result arises from failing to generate a relevant subset or from failing to rank that subset correctly. This credit assignment gap makes end-to-end optimization unstable and sample-inefficient. Existing systems often address this by separating candidate generation from ranking. However, such decoupling remains misaligned with downstream utility because ranking is limited by the candidate set it receives. To bridge this gap, we propose a unified framework that performs both within a single autoregressive rollout and optimizes them end-to-end via factorized group-relative policy optimization (F-GRPO). Our framework factorizes the policy into candidate generation and ranking while sharing a single LLM backbone, and jointly trains them with an order-invariant coverage reward and a position-aware utility reward. To address the resulting phase-specific credit assignment problem, we use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective. Across sequential recommendation and multi-hop question answering benchmarks, F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers, with no architectural changes at inference time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

F-GRPO factorizes GRPO into separate generation and ranking advantages inside one rollout to fix credit assignment, but the split may leak through the shared backbone.

read the letter

The main takeaway is that F-GRPO tries to optimize both candidate generation and ranking together in a single autoregressive LLM rollout by factorizing the group-relative policy optimization and giving each phase its own advantage estimator. This is a reasonable response to the credit assignment problem in generative ranking. When the model produces a ranked list in one go, the final utility score doesn't break down into whether the right items were picked or whether they were ordered well. The paper uses an order-invariant coverage reward to handle the subset selection and a position-aware utility reward for the ordering, then applies separate group-relative advantages within a two-phase sequence-level objective. They share one LLM backbone for both, which keeps things efficient. On sequential recommendation and multi-hop QA benchmarks, the method reportedly beats standard GRPO and decoupled baselines while matching strong zero-shot rerankers without any inference-time architecture changes. The concern is whether the factorization actually delivers independent credit assignment. Because the output is one contiguous sequence, any division into generation and ranking phases has to rely on formatting conventions or masking. If the split is not sharp, the gradients from the position-aware reward can still propagate back to affect generation decisions through the shared parameters. The description does not include a derivation proving the estimators stay unbiased under this setup, and it would help to see an ablation that turns the phase-specific advantages on and off to measure their isolated contribution. Overall, this is for researchers focused on end-to-end training of LLMs for retrieval and recommendation tasks. The idea is practical and the reported results are encouraging enough to warrant closer examination. I would recommend sending it to peer review so the details on the phase split and the statistical robustness can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces F-GRPO, a factorized extension of group-relative policy optimization that unifies candidate generation and ranking inside a single autoregressive LLM rollout. It factorizes the policy into generation and ranking components that share one backbone, optimizes them jointly with an order-invariant coverage reward and a position-aware utility reward, and resolves the resulting credit-assignment problem by applying separate group-relative advantages inside a two-phase sequence-level objective. Experiments on sequential recommendation and multi-hop QA benchmarks report that F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers without any inference-time architectural changes.

Significance. If the factorization and phase-specific advantages can be shown to remain unbiased under the shared autoregressive coupling, the method would offer a practical route to end-to-end optimization of generative ranking pipelines that currently rely on staged retrieval-plus-reranking. The absence of inference-time overhead and the reported gains over both GRPO and supervised baselines would make the framework relevant to recommendation and retrieval-augmented generation systems.

major comments (2)

[Method / two-phase objective] The central technical claim—that separate group-relative advantages for the generation and ranking phases remain unbiased inside a single contiguous autoregressive sequence—requires an explicit derivation or proof. The abstract states that the two-phase objective addresses the credit-assignment gap, yet no argument is supplied showing that gradients from the position-aware utility reward do not leak back through the shared backbone into generation decisions when the phase boundary is realized only by formatting conventions or token masking.
[Experiments] The experimental section must include an ablation that isolates the effect of the factorization itself (i.e., F-GRPO versus a non-factorized GRPO baseline that uses the same two-phase formatting). Without this ablation, it is impossible to attribute the reported gains to the proposed advantage separation rather than to the joint training or the choice of rewards.

minor comments (2)

[Experiments] The abstract refers to “sequential recommendation and multi-hop question answering benchmarks” without naming the concrete datasets or reporting the number of runs and statistical significance tests; these details should be added to the experimental section.
[Method] Notation for the generation-phase and ranking-phase advantages should be introduced with explicit equations rather than descriptive text, to make the two-phase objective reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We appreciate the recognition of F-GRPO's potential for end-to-end optimization of generative ranking pipelines without inference-time overhead. We address each major comment below and will revise the manuscript to strengthen the technical presentation and experimental validation.

read point-by-point responses

Referee: [Method / two-phase objective] The central technical claim—that separate group-relative advantages for the generation and ranking phases remain unbiased inside a single contiguous autoregressive sequence—requires an explicit derivation or proof. The abstract states that the two-phase objective addresses the credit-assignment gap, yet no argument is supplied showing that gradients from the position-aware utility reward do not leak back through the shared backbone into generation decisions when the phase boundary is realized only by formatting conventions or token masking.

Authors: We acknowledge that the manuscript would benefit from an explicit derivation showing that the phase-specific group-relative advantages remain unbiased under the shared autoregressive backbone. In the revised version we will add a dedicated subsection deriving the policy gradient for the two-phase objective. The derivation will demonstrate that, by computing separate advantages over masked phase-specific tokens and applying the group-relative baseline within each phase, gradients from the position-aware utility reward are isolated to the ranking tokens and do not propagate back to generation decisions. revision: yes
Referee: [Experiments] The experimental section must include an ablation that isolates the effect of the factorization itself (i.e., F-GRPO versus a non-factorized GRPO baseline that uses the same two-phase formatting). Without this ablation, it is impossible to attribute the reported gains to the proposed advantage separation rather than to the joint training or the choice of rewards.

Authors: We agree that an ablation isolating the factorization and advantage separation is necessary. We will add this experiment in the revised manuscript: a non-factorized GRPO baseline that uses identical two-phase formatting and the same order-invariant coverage plus position-aware utility rewards, but applies a single group-relative advantage over the entire sequence. Performance differences versus F-GRPO will be reported on both the sequential recommendation and multi-hop QA benchmarks to attribute gains specifically to the phase-specific advantages. revision: yes

Circularity Check

0 steps flagged

No significant circularity; F-GRPO defined as new factorized objective from credit-assignment setup

full rationale

The paper introduces F-GRPO as an explicit new optimization framework that factorizes a single autoregressive policy into generation and ranking phases, using separate group-relative advantages inside a two-phase sequence-level objective along with order-invariant coverage and position-aware utility rewards. This construction is presented directly from the stated credit-assignment gap in the abstract, without any quoted equations or self-citations that reduce the claimed separation or performance gains back to fitted inputs or prior results by definition. The central claim therefore remains an independent modeling choice rather than a renaming or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach rests on standard RL concepts (group-relative advantages, sequence-level rewards) and the assumption that factorization resolves credit assignment.

pith-pipeline@v0.9.0 · 5639 in / 1064 out tokens · 52402 ms · 2026-05-14T19:48:31.789661+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective... L(θ) = L_slate + λ L_rank + β_KL D_KL
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Factorized credit assignment... phase-specific gradient weighting

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 6 internal anchors

[1]

Ellis, Brian Whitman, and Paul Lamere

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval ( ISMIR 2011) , 2011

work page 2011
[2]

Autoregressive search engines: Generating substrings as document identifiers

Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Scott Yih, Sebastian Riedel, and Fabio Petroni. Autoregressive search engines: Generating substrings as document identifiers. Advances in Neural Information Processing Systems, 35: 0 31668--31683, 2022

work page 2022
[3]

Generative slate recommendation with reinforcement learning

Romain Deffayet, Thibaut Thonet, Jean-Michel Renders, and Maarten de Rijke. Generative slate recommendation with reinforcement learning. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM '23, pp.\ 580–588, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394079. doi:10.1145/3539597.35...

work page doi:10.1145/3539597.3570412 2023
[4]

Chang, Claire Cardie, Kianté Brantley, and Thorsten Joachim

Ge Gao, Jonathan D. Chang, Claire Cardie, Kianté Brantley, and Thorsten Joachim. Policy-gradient training of language models for ranking, 2024. URL https://arxiv.org/abs/2310.04407

work page arXiv 2024
[5]

R e2 G : Retrieve, rerank, generate

Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Naik, Pengshan Cai, and Alfio Gliozzo. R e2 G : Retrieve, rerank, generate. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...

work page doi:10.18653/v1/2022.naacl-main.194 2022
[6]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

work page 2025
[7]

Towards two-stage counterfactual learning to rank

Shashank Gupta, Yiming Liao, and Maarten de Rijke. Towards two-stage counterfactual learning to rank. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), ICTIR '25, pp.\ 177–182, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400718618. doi:10.1145/3731...

work page doi:10.1145/3731120.3744583 2025
[8]

Maxwell Harper and Joseph A

F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5 0 (4), December 2015. ISSN 2160-6455. doi:10.1145/2827872. URL https://doi.org/10.1145/2827872

work page doi:10.1145/2827872 2015
[9]

Session-based Recommendations with Recurrent Neural Networks

Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks, 2016. URL https://arxiv.org/abs/1511.06939

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Towards universal sequence representation learning for recommender systems

Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '22, pp.\ 585–593, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393850. doi:10...

work page doi:10.1145/3534678.3539381 2022
[11]

Large language models are zero-shot rankers for recommender systems

Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. Large language models are zero-shot rankers for recommender systems. In Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part II, pp.\ 364–381, Berlin, Heidelberg, 202...

work page doi:10.1007/978-3-031-56060-6_24 2024
[12]

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=NFM8F5cV0V

work page 2025
[13]

Interactive visualization recommendation with hier-sucb

Songwen Hu, Ryan A Rossi, Tong Yu, Junda Wu, Handong Zhao, Sungchul Kim, and Shuai Li. Interactive visualization recommendation with hier-sucb. In Proceedings of the ACM on Web Conference 2025, pp.\ 313--321, 2025 b

work page 2025
[14]

A survey of foundation model-powered recommender systems: From feature-based, generative to agentic paradigms

Chengkai Huang, Hongtao Huang, Tong Yu, Kaige Xie, Junda Wu, Shuai Zhang, Julian Mcauley, Dietmar Jannach, and Lina Yao. A survey of foundation model-powered recommender systems: From feature-based, generative to agentic paradigms. arXiv preprint arXiv:2504.16420, 2025 a

work page arXiv 2025
[15]

Towards agentic recommender systems in the era of multimodal large language models

Chengkai Huang, Junda Wu, Yu Xia, Zixu Yu, Ruhan Wang, Tong Yu, Ruiyi Zhang, Ryan A Rossi, Branislav Kveton, Dongruo Zhou, et al. Towards agentic recommender systems in the era of multimodal large language models. arXiv preprint arXiv:2503.16734, 2025 b

work page arXiv 2025
[16]

Pluralistic off-policy evaluation and alignment

Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, and Lina Yao. Pluralistic off-policy evaluation and alignment. arXiv preprint arXiv:2509.19333, 2025 c

work page arXiv 2025
[17]

Listwise preference diffusion optimization for user behavior trajectories prediction

Hongtao Huang, Chengkai Huang, Junda Wu, Tong Yu, Julian McAuley, and Lina Yao. Listwise preference diffusion optimization for user behavior trajectories prediction. Advances in Neural Information Processing Systems, 38: 0 159383--159408, 2026 a

work page 2026
[18]

Image difference captioning via adversarial preference optimization

Zihan Huang, Junda Wu, Rohan Surana, Tong Yu, David Arbour, Ritwik Sinha, and Julian McAuley. Image difference captioning via adversarial preference optimization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 33746--33758, 2025 d

work page 2025
[19]

Evaluation on entity matching in recommender systems

Zihan Huang, Rohan Surana, Zhouhang Xie, Junda Wu, Yu Xia, and Julian McAuley. Evaluation on entity matching in recommender systems. arXiv preprint arXiv:2601.17218, 2026 b

work page arXiv 2026
[20]

Active learning for direct preference optimization

Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, and Tong Yu. Active learning for direct preference optimization. arXiv preprint arXiv:2503.01076, 2025

work page arXiv 2025
[21]

A personalized conversational benchmark: Towards simulating personalized conversations

Li Li, Peilin Cai, Ryan A Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, et al. A personalized conversational benchmark: Towards simulating personalized conversations. arXiv preprint arXiv:2505.14106, 2025 a

work page arXiv 2025
[22]

Importance sampling for multi-negative multimodal direct preference optimization

Xintong Li, Chuhan Wang, Junda Wu, Rohan Surana, Tong Yu, Julian McAuley, and Jingbo Shang. Importance sampling for multi-negative multimodal direct preference optimization. arXiv preprint arXiv:2509.25717, 2025 b

work page arXiv 2025
[23]

Ract: Ranking-aware chain-of-thought optimization for llms

Haowei Liu, Xuyang Wu, Guohao Sun, Hsin-Tai Wu, Zhiqiang Tao, and Yi Fang. Ract: Ranking-aware chain-of-thought optimization for llms. In Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2025, pp.\ 178–188, New York, NY, USA, 2025 a . Association for...

work page doi:10.1145/3767695.3769487 2025
[24]

Learning to rank for information retrieval

Tie-Yan Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval , 3 0 (3): 0 225--331, 2009

work page 2009
[25]

Understanding r1-zero-like training: A critical perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. In Second Conference on Language Modeling, 2025 b . URL https://openreview.net/forum?id=5PAF7PAY2Y

work page 2025
[26]

Recranker: Instruction tuning large language model as ranker for top-k recommendation

Sichun Luo, Bowei He, Haohan Zhao, Wei Shao, Yanlin Qi, Yinya Huang, Aojun Zhou, Yuxuan Yao, Zongpeng Li, Yuanzhang Xiao, Mingjie Zhan, and Linqi Song. Recranker: Instruction tuning large language model as ranker for top-k recommendation. ACM Trans. Inf. Syst., 43 0 (5), July 2025. ISSN 1046-8188. doi:10.1145/3705728. URL https://doi.org/10.1145/3705728

work page doi:10.1145/3705728 2025
[27]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention, 2025. URL https://arxiv.org/abs/2506.13585

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Ws-grpo: Weakly-supervised group-relative policy optimization for rollout-efficient reasoning

Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, et al. Ws-grpo: Weakly-supervised group-relative policy optimization for rollout-efficient reasoning. arXiv preprint arXiv:2602.17025, 2026

work page arXiv 2026
[29]

Large language models for conversational user simulation: A comprehensive survey

Bo Ni, Leyao Wang, Yu Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Leura, Samyadeep Basu, Subhojyoti Mukherjee, et al. Large language models for conversational user simulation: A comprehensive survey. 2025

work page 2025
[30]

A survey on llm-based conversational user simulation

Bo Ni, Yu Wang, Leyao Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Luera, Samyadeep Basu, Subhojyoti Mukherjee, et al. A survey on llm-based conversational user simulation. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4266--4301, 2026

work page 2026
[31]

Document ranking with a pretrained sequence-to-sequence model

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. Document ranking with a pretrained sequence-to-sequence model. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 708--718, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-em...

work page doi:10.18653/v1/2020.findings-emnlp.63 2020
[32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[33]

Higr: Efficient generative slate recommendation via hierarchical planning and multi-objective preference alignment, 2026

Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Hongyong Yu, Chengxiang Zhuo, and Zang Li. Higr: Efficient generative slate recommendation via hierarchical planning and multi-objective preference alignment, 2026. URL https://arxiv.org/abs/2512.24787

work page arXiv 2026
[34]

The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, 2021

Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, 2021. URL https://arxiv.org/abs/2101.05667

work page arXiv 2021
[35]

RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, 2023. URL https://arxiv.org/abs/2312.02724

work page arXiv 2023
[36]

Qwen3.5 : Towards native multimodal agents, February 2026

Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[37]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: your language model is secretly a reward model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA, 2023. Curran Associates Inc

work page 2023
[38]

The probabilistic relevance framework: Bm25 and beyond

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3 0 (4): 0 333–389, April 2009. ISSN 1554-0669. doi:10.1561/1500000019. URL https://doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[39]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Rankllm: A python package for reranking with llms

Sahel Sharifymoghaddam, Ronak Pradeep, Andre Slavescu, Ryan Nguyen, Andrew Xu, Zijian Chen, Yilin Zhang, Yidi Chen, Jasper Xian, and Jimmy Lin. Rankllm: A python package for reranking with llms. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '25, pp.\ 3681–3690, New York, NY, USA, ...

work page doi:10.1145/3726302.3730331 2025
[42]

Is C hat GPT good at search? investigating large language models as re-ranking agents

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is C hat GPT good at search? investigating large language models as re-ranking agents. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 14918--14937, Si...

work page doi:10.18653/v1/2023.emnlp-main.923 2023
[43]

From reviews to dialogues: Active synthesis for zero-shot llm-based conversational recommender system

Rohan Surana, Junda Wu, Zhouhang Xie, Yu Xia, Harald Steck, Dawen Liang, Nathan Kallus, and Julian McAuley. From reviews to dialogues: Active synthesis for zero-shot llm-based conversational recommender system. arXiv preprint arXiv:2504.15476, 2025

work page arXiv 2025
[44]

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, Nikki Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan Rossi, Kuan-Hao Huang, Jingbo Shang, Jiawei Han, and Julian McAuley. Generate, filter, control, replay: A comprehensive survey of rollout s...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Maximum likelihood reinforcement learning, 2026

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning, 2026. URL https://arxiv.org/abs/2602.02710

work page arXiv 2026
[46]

Scaling down, litting up: Efficient zero-shot listwise reranking with seq2seq encoder-decoder models, 2023

Manveer Singh Tamber, Ronak Pradeep, and Jimmy Lin. Scaling down, litting up: Efficient zero-shot listwise reranking with seq2seq encoder-decoder models, 2023. URL https://arxiv.org/abs/2312.16098

work page arXiv 2023
[47]

Listwise generative retrieval models via a sequential learning process

Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, and Xueqi Cheng. Listwise generative retrieval models via a sequential learning process. ACM Trans. Inf. Syst., 42 0 (5), April 2024. ISSN 1046-8188. doi:10.1145/3653712. URL https://doi.org/10.1145/3653712

work page doi:10.1145/3653712 2024
[48]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

M u S i Q ue: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. M u S i Q ue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022. doi:10.1162/tacl_a_00475. URL https://aclanthology.org/2022.tacl-1.31/

work page doi:10.1162/tacl_a_00475 2022
[50]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\...

work page doi:10.18653/v1/2023.acl-long.557 2023
[51]

Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes

Chuhan Wang, Xintong Li, Jennifer Yuntong Zhang, Junda Wu, Chengkai Huang, Lina Yao, Julian McAuley, and Jingbo Shang. Scenealign: Aligning multimodal reasoning to scene graphs in complex visual scenes. arXiv preprint arXiv:2601.05600, 2026

work page arXiv 2026
[52]

arXiv preprint arXiv:2304.03153 , year=

Lei Wang and Ee-Peng Lim. Zero-shot next-item recommendation using large pretrained language models, 2023. URL https://arxiv.org/abs/2304.03153

work page arXiv 2023
[53]

A neural corpus indexer for document retrieval

Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al. A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems, 35: 0 25600--25614, 2022

work page 2022
[54]

Ctrls: Chain-of-thought reasoning via latent state-transition

Junda Wu, Yuxin Xiong, Xintong Li, Sheldon Yu, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, and Julian McAuley. Ctrls: Chain-of-thought reasoning via latent state-transition. In The 29th International Conference on Artificial Intelligence and Statistics

work page
[55]

Deconfounded and explainable interactive vision-language retrieval of complex scenes

Junda Wu, Tong Yu, and Shuai Li. Deconfounded and explainable interactive vision-language retrieval of complex scenes. MM '21, pp.\ 2103–2111, New York, NY, USA, 2021 a . Association for Computing Machinery. ISBN 9781450386517. doi:10.1145/3474085.3475366. URL https://doi.org/10.1145/3474085.3475366

work page doi:10.1145/3474085.3475366 2021
[56]

Clustering of conversational bandits for user preference learning and elicitation

Junda Wu, Canzhe Zhao, Tong Yu, Jingyang Li, and Shuai Li. Clustering of conversational bandits for user preference learning and elicitation. CIKM '21, pp.\ 2129–2139, New York, NY, USA, 2021 b . Association for Computing Machinery. ISBN 9781450384469. doi:10.1145/3459637.3482328. URL https://doi.org/10.1145/3459637.3482328

work page doi:10.1145/3459637.3482328 2021
[57]

Dynamics-aware adaptation for reinforcement learning based cross-domain interactive recommendation

Junda Wu, Zhihui Xie, Tong Yu, Handong Zhao, Ruiyi Zhang, and Shuai Li. Dynamics-aware adaptation for reinforcement learning based cross-domain interactive recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, pp.\ 290–300, New York, NY, USA, 2022. Association for Com...

work page doi:10.1145/3477495.3531969 2022
[58]

Coral: Collaborative retrieval-augmented large language models improve long-tail recommendation

Junda Wu, Cheng-Chun Chang, Tong Yu, Zhankui He, Jianing Wang, Yupeng Hou, and Julian McAuley. Coral: Collaborative retrieval-augmented large language models improve long-tail recommendation. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp.\ 3391--3401, 2024 a

work page 2024
[59]

Personalized multimodal large language models: A survey.arXiv:2412.02142, 2024

Junda Wu, Hanjia Lyu, Yu Xia, Zhehao Zhang, Joe Barrow, Ishita Kumar, Mehrnoosh Mirtaheri, Hongjie Chen, Ryan A Rossi, Franck Dernoncourt, et al. Personalized multimodal large language models: A survey. arXiv preprint arXiv:2412.02142, 2024 b

work page arXiv 2024
[60]

Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention

Junda Wu, Tong Yu, Xiang Chen, Haoliang Wang, Ryan Rossi, Sungchul Kim, Anup Rao, and Julian McAuley. Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14073--14087, 2024 c

work page 2024
[61]

Collap: Contrastive long-form language-audio pretraining with musical temporal structure augmentation

Junda Wu, Warren Li, Zachary Novack, Amit Namburi, Carol Chen, and Julian McAuley. Collap: Contrastive long-form language-audio pretraining with musical temporal structure augmentation. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025 a

work page 2025
[62]

Ocean: Offline chain-of-thought evaluation and alignment in large language models

Junda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, et al. Ocean: Offline chain-of-thought evaluation and alignment in large language models. In International Conference on Learning Representations, volume 2025, pp.\ 100570--100589, 2025 b

work page 2025
[63]

Rossi, Prithviraj Ammanabrolu, and Julian McAuley

Junda Wu, Rohan Surana, Zhouhang Xie, Yiran Shen, Yu Xia, Tong Yu, Ryan A. Rossi, Prithviraj Ammanabrolu, and Julian McAuley. In-context ranking preference optimization. In Second Conference on Language Modeling, 2025 c . URL https://openreview.net/forum?id=L2NPhLAKEd

work page 2025
[64]

Doc-react: Multi-page heterogeneous document question-answering

Junda Wu, Yu Xia, Tong Yu, Xiang Chen, Sai Sree Harsha, Akash V Maharaj, Ruiyi Zhang, Victor Bursztyn, Sungchul Kim, Ryan A Rossi, et al. Doc-react: Multi-page heterogeneous document question-answering. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 67--78, 2025 d

work page 2025
[65]

Sand: Boosting llm agents with self-taught action deliberation

Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, and Julian McAuley. Sand: Boosting llm agents with self-taught action deliberation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 3062--3077, 2025 a

work page 2025
[66]

Knowledge-aware query expansion with large language models for textual and relational retrieval

Yu Xia, Junda Wu, Sungchul Kim, Tong Yu, Ryan A Rossi, Haoliang Wang, and Julian McAuley. Knowledge-aware query expansion with large language models for textual and relational retrieval. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long...

work page 2025
[67]

A survey on personalized and pluralistic preference alignment in large language models

Zhouhang Xie, Junda Wu, Yiran Shen, Raghav Jain, Yu Xia, Xintong Li, Aaron Chang, Ryan A Rossi, Tong Yu, Sachin Kumar, et al. A survey on personalized and pluralistic preference alignment in large language models. In Second Conference on Language Modeling

work page
[68]

Neighborhood-based collaborative filtering for conversational recommendation

Zhouhang Xie, Junda Wu, Hyunsik Jeon, Zhankui He, Harald Steck, Rahul Jha, Dawen Liang, Nathan Kallus, and Julian McAuley. Neighborhood-based collaborative filtering for conversational recommendation. In Proceedings of the 18th ACM Conference on Recommender Systems, pp.\ 1045--1050, 2024

work page 2024
[69]

List items one by one: A new data source and learning paradigm for multimodal llms

An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, et al. List items one by one: A new data source and learning paradigm for multimodal llms. In First Conference on Language Modeling

work page
[70]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. H otpot QA : A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proce...

work page doi:10.18653/v1/d18-1259 2018
[71]

DAPO : An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

work page 2025
[72]

Explainable chain-of-thought reasoning: An empirical analysis on state-aware reasoning dynamics

Sheldon Yu, Yuxin Xiong, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, and Julian McAuley. Explainable chain-of-thought reasoning: An empirical analysis on state-aware reasoning dynamics. arXiv preprint arXiv:2509.00190, 2025 b

work page arXiv 2025
[73]

Llamarec: Two-stage recommendation using large language models for ranking, 2023

Zhenrui Yue, Sara Rabhi, Gabriel de Souza Pereira Moreira, Dong Wang, and Even Oldridge. Llamarec: Two-stage recommendation using large language models for ranking, 2023. URL https://arxiv.org/abs/2311.02089

work page arXiv 2023
[74]

Gvpo: Group variance policy optimization for large language model post-training, 2025

Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, and Hui Xiong. Gvpo: Group variance policy optimization for large language model post-training, 2025. URL https://arxiv.org/abs/2504.19599

work page arXiv 2025
[75]

Rank-grpo: Training llm-based conversational recommender systems with reinforcement learning, 2026

Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Vito Ostuni, Jundong Li, and Nathan Kallus. Rank-grpo: Training llm-based conversational recommender systems with reinforcement learning, 2026. URL https://arxiv.org/abs/2510.20150

work page arXiv 2026
[76]

Rank-r1: Enhancing reasoning in llm-based document rerankers via reinforcement learning, 2025

Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon. Rank-r1: Enhancing reasoning in llm-based document rerankers via reinforcement learning, 2025. URL https://arxiv.org/abs/2503.06034

work page arXiv 2025