RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

Hamed Zamani; Hansi Zeng; Hong Yu; Jiatan Huang; Mingchen Li; Zhuo Qian

arxiv: 2605.26352 · v1 · pith:53SCSEY3new · submitted 2026-05-25 · 💻 cs.CL

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

Mingchen Li , Hansi Zeng , Zhuo Qian , Jiatan Huang , Hamed Zamani , Hong Yu This is my paper

Pith reviewed 2026-06-29 21:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords credit assignmentpolicy optimizationretrieval agentsreasoning agentscounterfactual evaluationinteractive retrievallanguage agentsretrieval metrics

0 comments

The pith

RICE-PO turns retrieval interactions into localized credit signals that train reasoning agents without a critic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the credit-assignment problem in training language agents that retrieve evidence iteratively. Executable actions such as queries can be scored directly by the retriever, but the latent reasoning steps that precede them cannot. RICE-PO therefore selects high-uncertainty actions as anchors, generates local counterfactual branches, and uses retrieval metrics to decide whether preceding reasoning deserves credit. The method produces better agents than prompt-based or group-based reinforcement-learning baselines on the same retriever and benchmarks.

Core claim

RICE-PO is a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. It selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects remain stable. On BRIGHT and BEIR this produces consistent gains over prompt-based agents and group-based RL baselines under identical retriever settings.

What carries the argument

RICE-PO, a critic-free policy optimization method that anchors credit assignment on high-uncertainty executable actions and measures reasoning influence via local counterfactual retrieval evaluations.

If this is right

Localized retrieval metrics can replace outcome-level rewards for training multi-step retrieval agents.
Credit assignment becomes possible without training or maintaining a separate critic model.
The same retriever can supply both evidence and training signals during agent optimization.
Reasoning steps receive credit only when they demonstrably alter subsequent retrieval outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring technique could be tested in other agent settings where some actions are directly scored by an environment.
Longer interaction traces might require additional checks on residual stability beyond what the current method assumes.
Interaction logs alone might reduce reliance on external reward models for reasoning agents.

Load-bearing premise

High-uncertainty executable actions serve as reliable anchors whose local counterfactual branches can measure the influence of prior reasoning while future residual effects stay stable.

What would settle it

Replace the uncertainty-based anchor selection with random actions and measure whether performance gains on BRIGHT and BEIR disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.26352 by Hamed Zamani, Hansi Zeng, Hong Yu, Jiatan Huang, Mingchen Li, Zhuo Qian.

**Figure 2.** Figure 2: Ablation of reasoning credit assignment variants Ablation of Credit Propagation Mechanisms We conduct an ablation study on BRIGHT with DeepSeek-R1- Distill-Qwen-1.5B to analyze different reasoning-credit assignment strategies. For a fair comparison, all variants use the same entropy-triggered anchor points. Case 1 assigns the reasoning advantage from the paired steplevel summary advantage. Case 2 assign… view at source ↗

**Figure 3.** Figure 3: Model performance with effect-only and influence-only credit propagation However, it ignores whether this change ultimately affects the final retrieval reward. In contrast, the Effect Only variant considers only the impact of the reasoning action on the final reward, without explicitly measuring its influence on the intermediate summary [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: CDF of task-level gains from entropy-based triggering over random triggering. Entropy-based Triggering [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt template for the multi-step retrieval agent. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Reward dynamics comparison on BRIGHT and BEIR. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RICE-PO gives a practical way to assign credit to latent reasoning steps in retrieval agents by anchoring on high-uncertainty actions and local counterfactuals, but the outperformance claims rest on assumptions about stable residuals that need direct checks in the experiments.

read the letter

RICE-PO turns the interaction structure itself into supervision for credit assignment. It identifies high-uncertainty executable actions as anchors, builds local counterfactual branches scored by retrieval metrics, and passes credit to the preceding reasoning only when the influence link is strong and future residuals look stable. That framing directly targets the asymmetry the abstract describes: actions are observable to the retriever while reasoning is not.

The paper does a clean job naming the problem and offering a critic-free alternative to outcome-level RL or prompt tuning. The reported gains on BRIGHT and BEIR against both prompt baselines and group-based RL, all under the same retriever, suggest the localized signals can matter in practice.

The soft spots sit in the empirical grounding. The abstract states consistent outperformance, yet the strength of that result depends on how "strong influence" and "stable residuals" are measured and whether ablations isolate the contribution of the propagation rule versus other design choices. The assumption that local branches remain informative without large future interference is plausible but not automatic in longer or noisier sessions; if the full experiments do not test that directly, the gains could shrink.

This is for people already working on interactive RAG agents and policy optimization for language models. A reader who needs a concrete handle on credit assignment in retrieval loops will find the construction useful even before the numbers are replicated.

Send it to peer review. The core construction is coherent and the problem is real; referees can verify the implementation details and controls.

Referee Report

0 major / 1 minor

Summary. The paper proposes RICE-PO, a critic-free policy optimization framework for training reasoning-based retrieval agents. It addresses credit assignment asymmetry by selecting high-uncertainty executable actions as anchors, evaluating local counterfactual branches via retrieval metrics, and propagating credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. Empirical results claim consistent outperformance over prompt-based agents and group-based RL baselines on BRIGHT and BEIR under identical retriever settings.

Significance. If the results hold, the work shows that interaction structure itself can yield localized supervision signals without external critics, which could improve training efficiency for interactive retrieval agents. The localized counterfactual approach may have value in other agentic RL settings with sparse or delayed rewards.

minor comments (1)

The abstract states that RICE-PO 'consistently outperforms' baselines but supplies no quantitative metrics, statistical tests, ablation results, or implementation details, making it impossible to assess whether the central empirical claim is supported.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their time and for providing a concise summary of RICE-PO along with noting its potential significance for localized supervision in agentic RL. No specific major comments appear in the report, so we have nothing to address point-by-point. We are happy to supply additional details, ablations, or clarifications if the editor or referee requests them.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description present RICE-PO as a framework that derives credit signals from external retrieval metrics applied to local counterfactual branches around high-uncertainty executable actions, with propagation conditioned on measurable reasoning-to-action influence and residual stability. No equations, parameter-fitting steps, or self-citations are shown that would reduce the central claims or predictions to inputs by construction. The outperformance on BRIGHT and BEIR is framed as an empirical consequence of using these interaction-derived signals rather than outcome-level rewards, with no indication that the method's results are tautological or forced by internal definitions. The derivation chain appears self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level description of the method.

axioms (1)

domain assumption Retrieval metrics provide valid local evaluations of counterfactual action branches
Invoked when the method evaluates influence of reasoning steps via retrieval scores.

pith-pipeline@v0.9.1-grok · 5726 in / 1145 out tokens · 28531 ms · 2026-06-29T21:21:56.307437+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 21 canonical work pages · 6 internal anchors

[1]

Rader: Reasoning-aware dense retrieval models

Debrup Das, Sam O’Nuallain, and Razieh Rahimi. Rader: Reasoning-aware dense retrieval models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19981–20008, 2025

2025
[2]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817, 2026

Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, and Bo An. Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817, 2026

work page arXiv 2026
[4]

Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

Jiatan Huang, Zheyuan Zhang, Tianyi Ma, Mingchen Li, Yaning Zheng, Yanfang Ye, and Chuxu Zhang. Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

work page arXiv 2026
[5]

EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, and Chuxu Zhang. Evolver- outer: Co-evolving routing and prompt for multi-agent question answering.arXiv preprint arXiv:2604.05149, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Neuralmind-unicamp at 2022 trec neuclir: Large boring rerankers for cross-lingual retrieval.arXiv preprint arXiv:2303.16145, 2023

Vitor Jeronymo, Roberto Lotufo, and Rodrigo Nogueira. Neuralmind-unicamp at 2022 trec neuclir: Large boring rerankers for cross-lingual retrieval.arXiv preprint arXiv:2303.16145, 2023

work page arXiv 2022
[7]

Tree search for llm agent reinforcement learning, 2026

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

work page arXiv 2025
[8]

Tree search for llm agent reinforcement learning

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning. InInternational Conference on Learning Representations, 2026

2026
[9]

Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning.arXiv preprint arXiv:2503.00223, 2025

Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, and Jiawei Han. Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning.arXiv preprint arXiv:2503.00223, 2025

work page arXiv 2025
[10]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

A survey of recommendation systems: recommendation models, techniques, and application fields.Electronics, 11(1):141, 2022

Hyeyoung Ko, Suyeon Lee, Yoonseo Park, and Anna Choi. A survey of recommendation systems: recommendation models, techniques, and application fields.Electronics, 11(1):141, 2022

2022
[12]

Thinkqe: Query expansion via an evolving thinking process.arXiv preprint arXiv:2506.09260, 2025

Yibin Lei, Tao Shen, and Andrew Yates. Thinkqe: Query expansion via an evolving thinking process.arXiv preprint arXiv:2506.09260, 2025

work page arXiv 2025
[13]

Think then rewrite: Reasoning enhanced query rewriting for domain specific retrieval

Ang Li, Yufei Shi, Yuxuan Si, Yiquan Wu, Ming Cai, Xu Tan, Yi Wang, Changlong Sun, Xiaozhong Liu, and Kun Kuang. Think then rewrite: Reasoning enhanced query rewriting for domain specific retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 15045–15053, 2026

2026
[14]

Semantic structure based query graph prediction for question answering over knowledge graph

Mingchen Li and Shihao Ji. Semantic structure based query graph prediction for question answering over knowledge graph. InProceedings of the 29th International Conference on Computational Linguistics, pages 1569–1579, 2022

2022
[15]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

2025
[16]

Conversational query rewriting with self-supervised learning

Hang Liu, Meng Chen, Youzheng Wu, Xiaodong He, and Bowen Zhou. Conversational query rewriting with self-supervised learning. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7628–7632. IEEE, 2021. 10

2021
[17]

Diver: A multi-stage approach for reasoning-intensive information retrieval

Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu, and Jiahai Wang. Diver: A multi-stage approach for reasoning-intensive information retrieval. arXiv preprint arXiv:2508.07995, 2025

work page arXiv 2025
[18]

Fine-tuning llama for multi-stage text retrieval

Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2421–2425, 2024

2024
[19]

Convgqr: Generative query reformulation for conversational search.arXiv preprint arXiv:2305.15645, 2023

Fengran Mo, Kelong Mao, Yutao Zhu, Yihong Wu, Kaiyu Huang, and Jian-Yun Nie. Convgqr: Generative query reformulation for conversational search.arXiv preprint arXiv:2305.15645, 2023

work page arXiv 2023
[20]

Document expansion by query prediction.arXiv preprint arXiv:1904.08375,

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query prediction.arXiv preprint arXiv:1904.08375, 2019

work page arXiv 1904
[21]

Tongsearch-qr: Reinforced query reasoning for retrieval.arXiv preprint arXiv:2506.11603, 2025

Xubo Qin, Jun Bai, Jiaqi Li, Zixia Jia, and Zilong Zheng. Tongsearch-qr: Reinforced query reasoning for retrieval.arXiv preprint arXiv:2506.11603, 2025

work page arXiv 2025
[22]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Modern information retrieval: A brief overview.IEEE Data Eng

Amit Singhal et al. Modern information retrieval: A brief overview.IEEE Data Eng. Bull., 24(4):35–43, 2001

2001
[25]

Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O

Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S Siegel, Michael Tang, et al. Bright: A realistic and challenging benchmark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883, 2024

work page arXiv 2024
[26]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Query2doc: Query expansion with large language models

Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models.arXiv preprint arXiv:2303.07678, 2023

work page arXiv 2023
[28]

Rank1: Test-time compute for reranking in information retrieval,

Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Benjamin Van Durme. Rank1: Test-time compute for reranking in information retrieval.arXiv preprint arXiv:2502.18418, 2025

work page arXiv 2025
[29]

Smartsearch: Process reward-guided query refinement for search agents.arXiv preprint arXiv:2601.04888, 2026

Tongyu Wen, Guanting Dong, and Zhicheng Dou. Smartsearch: Process reward-guided query refinement for search agents.arXiv preprint arXiv:2601.04888, 2026

work page arXiv 2026
[30]

Conqrr: Conversational query rewriting for retrieval with reinforcement learning.arXiv preprint arXiv:2112.08558, 2021

Zeqiu Wu, Yi Luan, Hannah Rashkin, David Reitter, Hannaneh Hajishirzi, Mari Ostendorf, and Gaurav Singh Tomar. Conqrr: Conversational query rewriting for retrieval with reinforcement learning.arXiv preprint arXiv:2112.08558, 2021. A Technical appendices and supplementary material A.1 Implementation Since BRIGHT and BEIR are primarily designed for evaluati...

work page arXiv 2021

[1] [1]

Rader: Reasoning-aware dense retrieval models

Debrup Das, Sam O’Nuallain, and Razieh Rahimi. Rader: Reasoning-aware dense retrieval models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19981–20008, 2025

2025

[2] [2]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817, 2026

Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, and Bo An. Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817, 2026

work page arXiv 2026

[4] [4]

Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

Jiatan Huang, Zheyuan Zhang, Tianyi Ma, Mingchen Li, Yaning Zheng, Yanfang Ye, and Chuxu Zhang. Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

work page arXiv 2026

[5] [5]

EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, and Chuxu Zhang. Evolver- outer: Co-evolving routing and prompt for multi-agent question answering.arXiv preprint arXiv:2604.05149, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Neuralmind-unicamp at 2022 trec neuclir: Large boring rerankers for cross-lingual retrieval.arXiv preprint arXiv:2303.16145, 2023

Vitor Jeronymo, Roberto Lotufo, and Rodrigo Nogueira. Neuralmind-unicamp at 2022 trec neuclir: Large boring rerankers for cross-lingual retrieval.arXiv preprint arXiv:2303.16145, 2023

work page arXiv 2022

[7] [7]

Tree search for llm agent reinforcement learning, 2026

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

work page arXiv 2025

[8] [8]

Tree search for llm agent reinforcement learning

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning. InInternational Conference on Learning Representations, 2026

2026

[9] [9]

Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning.arXiv preprint arXiv:2503.00223, 2025

Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, and Jiawei Han. Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning.arXiv preprint arXiv:2503.00223, 2025

work page arXiv 2025

[10] [10]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

A survey of recommendation systems: recommendation models, techniques, and application fields.Electronics, 11(1):141, 2022

Hyeyoung Ko, Suyeon Lee, Yoonseo Park, and Anna Choi. A survey of recommendation systems: recommendation models, techniques, and application fields.Electronics, 11(1):141, 2022

2022

[12] [12]

Thinkqe: Query expansion via an evolving thinking process.arXiv preprint arXiv:2506.09260, 2025

Yibin Lei, Tao Shen, and Andrew Yates. Thinkqe: Query expansion via an evolving thinking process.arXiv preprint arXiv:2506.09260, 2025

work page arXiv 2025

[13] [13]

Think then rewrite: Reasoning enhanced query rewriting for domain specific retrieval

Ang Li, Yufei Shi, Yuxuan Si, Yiquan Wu, Ming Cai, Xu Tan, Yi Wang, Changlong Sun, Xiaozhong Liu, and Kun Kuang. Think then rewrite: Reasoning enhanced query rewriting for domain specific retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 15045–15053, 2026

2026

[14] [14]

Semantic structure based query graph prediction for question answering over knowledge graph

Mingchen Li and Shihao Ji. Semantic structure based query graph prediction for question answering over knowledge graph. InProceedings of the 29th International Conference on Computational Linguistics, pages 1569–1579, 2022

2022

[15] [15]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, 2025

2025

[16] [16]

Conversational query rewriting with self-supervised learning

Hang Liu, Meng Chen, Youzheng Wu, Xiaodong He, and Bowen Zhou. Conversational query rewriting with self-supervised learning. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7628–7632. IEEE, 2021. 10

2021

[17] [17]

Diver: A multi-stage approach for reasoning-intensive information retrieval

Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu, and Jiahai Wang. Diver: A multi-stage approach for reasoning-intensive information retrieval. arXiv preprint arXiv:2508.07995, 2025

work page arXiv 2025

[18] [18]

Fine-tuning llama for multi-stage text retrieval

Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2421–2425, 2024

2024

[19] [19]

Convgqr: Generative query reformulation for conversational search.arXiv preprint arXiv:2305.15645, 2023

Fengran Mo, Kelong Mao, Yutao Zhu, Yihong Wu, Kaiyu Huang, and Jian-Yun Nie. Convgqr: Generative query reformulation for conversational search.arXiv preprint arXiv:2305.15645, 2023

work page arXiv 2023

[20] [20]

Document expansion by query prediction.arXiv preprint arXiv:1904.08375,

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query prediction.arXiv preprint arXiv:1904.08375, 2019

work page arXiv 1904

[21] [21]

Tongsearch-qr: Reinforced query reasoning for retrieval.arXiv preprint arXiv:2506.11603, 2025

Xubo Qin, Jun Bai, Jiaqi Li, Zixia Jia, and Zilong Zheng. Tongsearch-qr: Reinforced query reasoning for retrieval.arXiv preprint arXiv:2506.11603, 2025

work page arXiv 2025

[22] [22]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Modern information retrieval: A brief overview.IEEE Data Eng

Amit Singhal et al. Modern information retrieval: A brief overview.IEEE Data Eng. Bull., 24(4):35–43, 2001

2001

[25] [25]

Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O

Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S Siegel, Michael Tang, et al. Bright: A realistic and challenging benchmark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883, 2024

work page arXiv 2024

[26] [26]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[27] [27]

Query2doc: Query expansion with large language models

Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models.arXiv preprint arXiv:2303.07678, 2023

work page arXiv 2023

[28] [28]

Rank1: Test-time compute for reranking in information retrieval,

Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Benjamin Van Durme. Rank1: Test-time compute for reranking in information retrieval.arXiv preprint arXiv:2502.18418, 2025

work page arXiv 2025

[29] [29]

Smartsearch: Process reward-guided query refinement for search agents.arXiv preprint arXiv:2601.04888, 2026

Tongyu Wen, Guanting Dong, and Zhicheng Dou. Smartsearch: Process reward-guided query refinement for search agents.arXiv preprint arXiv:2601.04888, 2026

work page arXiv 2026

[30] [30]

Conqrr: Conversational query rewriting for retrieval with reinforcement learning.arXiv preprint arXiv:2112.08558, 2021

Zeqiu Wu, Yi Luan, Hannah Rashkin, David Reitter, Hannaneh Hajishirzi, Mari Ostendorf, and Gaurav Singh Tomar. Conqrr: Conversational query rewriting for retrieval with reinforcement learning.arXiv preprint arXiv:2112.08558, 2021. A Technical appendices and supplementary material A.1 Implementation Since BRIGHT and BEIR are primarily designed for evaluati...

work page arXiv 2021