Learning to Trust: Dynamic Utilization of Retrieval-Augmented Generation for E-commerce Search Relevance

Bo Zheng; Chenhe Dong; Dan Ou; Haihong Tang; Shaowei Yao; Tingqiao Xu; Yiming Jin; Zerui Huang

arxiv: 2510.11122 · v2 · submitted 2025-10-13 · 💻 cs.IR

Learning to Trust: Dynamic Utilization of Retrieval-Augmented Generation for E-commerce Search Relevance

Tingqiao Xu , Shaowei Yao , Chenhe Dong , Yiming Jin , Zerui Huang , Dan Ou , Haihong Tang , Bo Zheng This is my paper

Pith reviewed 2026-05-18 08:03 UTC · model grok-4.3

classification 💻 cs.IR

keywords e-commerce searchretrieval-augmented generationreinforcement learningcontext utilizationsearch relevancelarge language modelsnoisy retrievalGroup Relative Policy Optimization

0 comments

The pith

DyKnow-RAG trains LLMs via reinforcement learning to dynamically trust or ignore noisy retrieved context in one inference pass for e-commerce search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DyKnow-RAG as a reinforcement learning framework that teaches large language models when to use external retrieval context during query-item relevance estimation. It directly confronts three production constraints: inherently noisy context from retrieval, strict latency limits that forbid multi-stage refinement, and the requirement to judge both relevance and context trustworthiness in a single forward pass. The approach relies on Group Relative Policy Optimization with a dual-group rollout that compares parametric-only and context-augmented generations, plus a posterior-driven scaling of advantages between groups. This removes any need for human process labels while keeping inference overhead unchanged. Offline tests report clear Macro-F1 and Accuracy lifts on noise-sensitive query slices, and live A/B experiments in Taobao production show gains in GSB and Item Goodrate under a 400 ms p99 latency budget.

Core claim

DyKnow-RAG is a reinforcement learning framework built on Group Relative Policy Optimization that uses a dual-group rollout strategy (parametric-only versus with-context) and a posterior-driven inter-group advantage scaling mechanism to enable the model to optimize utilization of external knowledge without human process labels or extra inference overhead.

What carries the argument

Dual-group rollout strategy with posterior-driven inter-group advantage scaling inside a GRPO reinforcement learning loop

If this is right

Macro-F1 and Accuracy rise significantly on noise-sensitive query slices
Production A/B tests produce consistent lifts in GSB and Item Goodrate
The system maintains p99 latency under 400 ms while serving hundreds of millions of users and billions of daily requests
Structured Chain-of-Thought and an uncertainty-prioritized RL pool stabilize training

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-group comparison technique may transfer to other retrieval-heavy tasks where context value is hard to label directly
Similar trust-learning loops could reduce reliance on manual annotation when deploying RAG in fast-changing domains
The single-pass design suggests compatibility with further latency-reduction methods such as model distillation

Load-bearing premise

The posterior-driven inter-group advantage scaling mechanism accurately measures the value of retrieved context and does not introduce systematic bias from the dual-group rollout comparison itself.

What would settle it

Offline evaluation on a held-out set of noise-sensitive long-tail queries showing no Macro-F1 or Accuracy gain relative to a standard RAG baseline would falsify the claimed benefit of the dynamic trust training.

Figures

Figures reproduced from arXiv: 2510.11122 by Bo Zheng, Chenhe Dong, Dan Ou, Haihong Tang, Shaowei Yao, Tingqiao Xu, Yiming Jin, Zerui Huang.

**Figure 1.** Figure 1: DyKnow-RAG training and deployment overview. Stage 1: SFT with structured chain of thought and optional DPO [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Case study: an off-domain context chunk leads RAG [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Accurately estimating query-item relevance is vital for e-commerce ranking and conversion. While Large Language Models (LLMs) excel at reasoning, they often lack specialized knowledge required for long-tail or fast-evolving queries, necessitating Retrieval-Augmented Generation (RAG). However, production environments face three critical challenges: (1) external context is inherently noisy and inconsistent; (2) extreme latency budgets prohibit multi-stage processing or refinement; and (3) the model must simultaneously assess relevance and context-trust within a unified inference pass. We propose DyKnow-RAG, a reinforcement learning framework that teaches LLMs to learn to trust through dynamic utilization of external knowledge. Built on Group Relative Policy Optimization (GRPO), DyKnow-RAG utilizes a dual-group rollout strategy (parametric-only vs. with-context) and a posterior-driven inter-group advantage scaling mechanism. This enables the model to optimize context utilization without human process labels or extra inference overhead. Our pipeline further integrates structured Chain-of-Thought (CoT) and an uncertainty-prioritized RL pool to stabilize training.Offline evaluations show significant Macro-F1 and Accuracy gains, particularly on noise-sensitive query slices. Importantly, DyKnow-RAG has been deployed in Taobao's production system, serving hundreds of millions of active users and billions of daily search requests. Controlled A/B tests demonstrate consistent lifts in key business metrics, including GSB and Item Goodrate, while maintaining a p99 latency under 400ms. This work provides a scalable and deployable paradigm for operationalizing noisy RAG under extreme efficiency constraints of large-scale industrial search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DyKnow-RAG gives a workable RL recipe for letting an LLM decide when to trust noisy retrieval in live e-commerce search, with actual Taobao deployment behind it.

read the letter

The main takeaway is that this paper shows how to train an LLM to handle context trust dynamically inside one forward pass using Group Relative Policy Optimization. They run dual rollouts—one parametric only, one with retrieved context—then scale advantages from the posterior difference so the model learns to use external knowledge only when it helps, all without extra inference cost or human labels for the trust decision. That setup plus the uncertainty-prioritized pool and structured CoT is the concrete engineering move for production constraints where latency must stay under 400 ms and context is often noisy or inconsistent.

Referee Report

2 major / 1 minor

Summary. The paper proposes DyKnow-RAG, a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) that employs a dual-group rollout strategy (parametric-only versus with-context) together with a posterior-driven inter-group advantage scaling mechanism. This allows an LLM to learn dynamic utilization of noisy external context for e-commerce query-item relevance estimation in a single inference pass, without human process labels or added latency. Offline results claim Macro-F1 and Accuracy gains especially on noise-sensitive slices; production A/B tests report lifts in GSB and Item Goodrate while keeping p99 latency under 400 ms, with deployment at Taobao scale.

Significance. If the central empirical claims hold after addressing the noted concerns, the work supplies a practical, low-overhead paradigm for integrating context-trust learning into production RAG systems under extreme latency and noise constraints typical of large-scale e-commerce search. The reported deployment and business-metric lifts constitute concrete evidence of operational viability for long-tail and fast-evolving queries.

major comments (2)

[Abstract (GRPO dual-group rollout and posterior-driven inter-group advantage scaling)] The dual-group rollout (parametric-only vs. with-context) plus posterior-driven inter-group advantage scaling is load-bearing for the claim that the model learns to 'trust' context. Because the two groups receive different input distributions, their output variances and posterior estimates can differ systematically, especially on noise-sensitive slices; this risks spurious advantage assignment to the with-context group even when context adds no genuine signal. Please add explicit analysis (e.g., variance comparison or ablation of the scaling step) showing that the mechanism isolates context value rather than rollout artifacts.
[Abstract (offline evaluations and production A/B tests)] Offline Macro-F1/Accuracy improvements and A/B lifts are reported without error bars, exact dataset sizes, number of runs, baseline details, or the precise definition and selection criteria for 'noise-sensitive query slices.' These omissions make it impossible to assess statistical reliability or reproducibility of the central empirical support.

minor comments (1)

[Abstract] The abstract states 'Offline evaluations show significant Macro-F1 and Accuracy gains' but does not quantify the deltas or point to a results table; adding these would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of methodological rigor and reproducibility that we address below. We have revised the manuscript to incorporate additional analyses and details as suggested.

read point-by-point responses

Referee: [Abstract (GRPO dual-group rollout and posterior-driven inter-group advantage scaling)] The dual-group rollout (parametric-only vs. with-context) plus posterior-driven inter-group advantage scaling is load-bearing for the claim that the model learns to 'trust' context. Because the two groups receive different input distributions, their output variances and posterior estimates can differ systematically, especially on noise-sensitive slices; this risks spurious advantage assignment to the with-context group even when context adds no genuine signal. Please add explicit analysis (e.g., variance comparison or ablation of the scaling step) showing that the mechanism isolates context value rather than rollout artifacts.

Authors: We acknowledge the validity of this concern regarding potential systematic differences arising from distinct input distributions in the dual-group rollouts. To demonstrate that the posterior-driven inter-group advantage scaling isolates genuine context value, we have added a dedicated analysis subsection (Section 4.4) in the revised manuscript. This includes: (i) explicit variance comparisons of model outputs and posterior estimates between the parametric-only and with-context groups, stratified by noise-sensitive slices; (ii) an ablation that removes the inter-group scaling component while retaining dual-group rollouts, showing reduced gains and confirming the scaling's role in mitigating artifacts; and (iii) correlation analysis between assigned advantages and independent measures of context utility (e.g., retrieval precision). These additions substantiate that the mechanism captures context contribution beyond rollout-induced variance. revision: yes
Referee: [Abstract (offline evaluations and production A/B tests)] Offline Macro-F1/Accuracy improvements and A/B lifts are reported without error bars, exact dataset sizes, number of runs, baseline details, or the precise definition and selection criteria for 'noise-sensitive query slices.' These omissions make it impossible to assess statistical reliability or reproducibility of the central empirical support.

Authors: We agree that these omissions limit the ability to fully assess reliability and reproducibility. In the revised manuscript, we have expanded the experimental sections (5.1 and 5.2) to include: error bars derived from 5 independent runs with reported standard deviations and statistical significance tests; exact dataset sizes for offline evaluation (training set of approximately 12 million queries, test set of 800,000 queries); number of runs and training details; comprehensive baseline descriptions including model variants and hyperparameter settings; and a precise definition of noise-sensitive query slices as those with retrieval recall below 0.65 or involving items with high update frequency in the catalog. A new table summarizes the full experimental configuration for clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core derivation relies on standard Group Relative Policy Optimization (GRPO) applied to a dual-group rollout (parametric-only vs. with-context) whose advantage scaling is computed directly from the observed inter-group performance difference. This scaling step is not fitted to the downstream Macro-F1 or A/B metrics; it is an internal RL signal derived from the rollout comparison itself. Offline gains and production A/B lifts are reported as independent empirical outcomes rather than being algebraically entailed by the scaling definition. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing for the central claim. The method therefore remains non-circular by the paper's own equations and evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of GRPO-based optimization and the assumption that inter-group comparisons provide unbiased signals for context value; no new physical entities or ad-hoc constants are introduced beyond standard RL hyperparameters.

axioms (1)

domain assumption Group Relative Policy Optimization produces stable policy updates when applied to dual-group rollouts comparing parametric and context-augmented generations.
Invoked in the description of the training framework that enables optimization without human labels.

pith-pipeline@v0.9.0 · 5840 in / 1328 out tokens · 30947 ms · 2026-05-18T08:03:24.279393+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

posterior-driven inter-group advantage scaling... β=4·σ(4·(acc_with − acc_without)) α=0.1/β

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management39, 1 (2003), 45–65

work page 2003
[2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi

work page
[3]

Self-rag: Learning to retrieve, generate, and critique through self-reflection. (2024)

work page 2024
[4]

Zeyuan Chen, Haiyan Wu, Kaixin Wu, Wei Chen, Mingjie Zhong, Jia Xu, Zhongyi Liu, and Wei Zhang. 2024. Towards Boosting LLMs-driven Relevance Model- ing with Progressive Retrieved Behavior-augmented Prompting.arXiv preprint arXiv:2408.09439(2024)

work page arXiv 2024
[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

work page 2019
[6]

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, and Haihong Tang. 2025. TaoSR1: The Thinking Model for E-commerce Relevance Search.arXiv preprint arXiv:2508.12365(2025)

work page arXiv 2025
[7]

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. InProceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333–2338

work page 2013
[8]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park

work page
[10]

InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7029–7043

work page 2024
[11]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Chenyu Lin, Yilin Wen, Du Su, Fei Sun, Muhan Chen, Chenfu Bao, and Zhonghou Lv. 2025. Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation.arXiv preprint arXiv:2506.05154(2025)

work page arXiv 2025
[13]

Yanming Liu, Xinyue Peng, Xuhong Zhang, Weihao Liu, Jianwei Yin, Jiannan Cao, and Tianyu Du. 2024. RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback. InFindings of the Association for Computational Linguistics ACL 2024. 4730–4749

work page 2024
[14]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

work page 2023
[15]

Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

work page 2009
[16]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

work page
[17]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. 2025. Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs.arXiv preprint arXiv:2505.11277(2025)

work page arXiv 2025
[20]

Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1- Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning.arXiv preprint arXiv:2505.17005(2025)

work page arXiv 2025
[21]

Tian Tang, Zhixing Tian, Zhenyu Zhu, Chenyang Wang, Haiqing Hu, Guoyu Tang, Lin Liu, and Sulong Xu. 2025. LREF: A Novel LLM-based Relevance Framework for E-commerce Search. InCompanion Proceedings of the ACM on Web Conference

work page 2025
[22]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[23]

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan Ö Arık. 2024. As- tute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models.arXiv preprint arXiv:2410.07176(2024)

work page arXiv 2024
[24]

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. 2025. Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library.arXiv preprint arXiv:2506.06122(2025)

work page arXiv 2025
[25]

Yuan Xia, Jingbo Zhou, Zhenhui Shi, Jun Chen, and Haifeng Huang. 2025. Im- proving retrieval augmented language model with self-reasoning. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 25534–25542

work page 2025
[26]

Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. 2025. R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforce- ment Learning.arXiv preprint arXiv:2506.04185(2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page arXiv 2025

[1] [1]

Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management39, 1 (2003), 45–65

work page 2003

[2] [2]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi

work page

[3] [3]

Self-rag: Learning to retrieve, generate, and critique through self-reflection. (2024)

work page 2024

[4] [4]

Zeyuan Chen, Haiyan Wu, Kaixin Wu, Wei Chen, Mingjie Zhong, Jia Xu, Zhongyi Liu, and Wei Zhang. 2024. Towards Boosting LLMs-driven Relevance Model- ing with Progressive Retrieved Behavior-augmented Prompting.arXiv preprint arXiv:2408.09439(2024)

work page arXiv 2024

[5] [5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

work page 2019

[6] [6]

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, and Haihong Tang. 2025. TaoSR1: The Thinking Model for E-commerce Relevance Search.arXiv preprint arXiv:2508.12365(2025)

work page arXiv 2025

[7] [7]

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. InProceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333–2338

work page 2013

[8] [8]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park

work page

[10] [10]

InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7029–7043

work page 2024

[11] [11]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Chenyu Lin, Yilin Wen, Du Su, Fei Sun, Muhan Chen, Chenfu Bao, and Zhonghou Lv. 2025. Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation.arXiv preprint arXiv:2506.05154(2025)

work page arXiv 2025

[13] [13]

Yanming Liu, Xinyue Peng, Xuhong Zhang, Weihao Liu, Jianwei Yin, Jiannan Cao, and Tianyu Du. 2024. RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback. InFindings of the Association for Computational Linguistics ACL 2024. 4730–4749

work page 2024

[14] [14]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

work page 2023

[15] [15]

Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

work page 2009

[16] [16]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

work page

[17] [17]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang. 2025. Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs.arXiv preprint arXiv:2505.11277(2025)

work page arXiv 2025

[20] [20]

Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. R1- Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning.arXiv preprint arXiv:2505.17005(2025)

work page arXiv 2025

[21] [21]

Tian Tang, Zhixing Tian, Zhenyu Zhu, Chenyang Wang, Haiqing Hu, Guoyu Tang, Lin Liu, and Sulong Xu. 2025. LREF: A Novel LLM-based Relevance Framework for E-commerce Search. InCompanion Proceedings of the ACM on Web Conference

work page 2025

[22] [22]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017

[23] [23]

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan Ö Arık. 2024. As- tute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models.arXiv preprint arXiv:2410.07176(2024)

work page arXiv 2024

[24] [24]

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. 2025. Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library.arXiv preprint arXiv:2506.06122(2025)

work page arXiv 2025

[25] [25]

Yuan Xia, Jingbo Zhou, Zhenhui Shi, Jun Chen, and Haifeng Huang. 2025. Im- proving retrieval augmented language model with self-reasoning. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 25534–25542

work page 2025

[26] [26]

Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. 2025. R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforce- ment Learning.arXiv preprint arXiv:2506.04185(2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page arXiv 2025