pith. sign in

arxiv: 2509.23370 · v2 · submitted 2025-09-27 · 💻 cs.CV

GRAPE: Let GRPO Supervise Query Rewriting by Ranking for Retrieval

Pith reviewed 2026-05-18 12:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords query rewritingretrieval augmentationpolicy optimizationCLIPmultimodal retrievalranking rewarddistribution shift
0
0 comments X

The pith

Ranking signals from a frozen retriever train an LLM to rewrite queries that better match the retriever's scoring behavior under distributional shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that an LLM can be trained to rewrite queries by receiving rewards based on how well those rewrites perform when ranked by the fixed retriever. This addresses performance drops on queries that differ in language, length, or modality without the cost of retraining the retriever or re-embedding the corpus. The authors identify that training directly on similarity scores causes irrelevant items to receive inflated scores, and they replace that with relative ranking comparisons within groups to give more reliable supervision. If the approach holds, retrieval systems gain flexibility to handle new query styles through a lightweight rewrite step while leaving the core model untouched.

Core claim

GRAPE applies Grouped Relative Policy Optimization (GRPO) to fine-tune an LLM for query rewriting, where the reward measures the relative ranking success of the rewritten query compared with the original as judged by the frozen retriever. This alignment with the retriever's latent distribution counters the score inflation that occurs in naive similarity-based fine-tuning and produces consistent gains across shifted query types.

What carries the argument

Grouped Relative Policy Optimization (GRPO) driven by a corpus-relative ranking-based reward that compares rewritten and original queries in batches.

Load-bearing premise

The relative ranking comparisons supplied by the retriever give an accurate and unbiased signal for what counts as a good query rewrite.

What would settle it

Experiments on the reported benchmarks or similar ones showing no improvement or a drop in Recall@10 when GRAPE is used versus simple rewriting baselines would show the ranking supervision fails to deliver the claimed alignment.

Figures

Figures reproduced from arXiv: 2509.23370 by Chenchen Zhao, Jianhuan Zhuo, Jihong Zhang, Mingyang Chen, Muxi Chen, Qiuyong Xiao, Tianwen Jiang, Wenyu Jiang, Yutang, Zhaohua Zhang, Zhixun Su.

Figure 1
Figure 1. Figure 1: Overview of GRAPE. The framework operates in two phases: (i) Query Rewriting and Retrieval, where an LLM generates K rewrites for a query, each encoded by a frozen CLIP retriever to produce target ranks; and (ii) Relative Reward Computation and Optimization, where a format reward and a rank reward are combined into a unified feedback signal, which is normalized within group to compute relative advantages a… view at source ↗
Figure 2
Figure 2. Figure 2: Case study of query rewriting, where Figures (a) and (b) illustrate multilingual query [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Case study of multimodal query rewriting. Since CLIP does not natively support multi [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: R@1 and R@10 performance across different training data scales for Flickr30k-CN and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Using similarity scores as the reward leads to score inflation. As training epochs increase, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of similarity-based and ranking-based rewards during training. Ranking [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Radar plots of Recall@1 (top) and Recall@10 (bottom) for GRAPE on XM3600 across [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

The CLIP model has established itself as a cornerstone of large-scale retrieval systems. However, its performance often degrades under distributional shifts such as multilingual, long-form, or multimodal queries. To avoid the prohibitive costs associated with retriever retraining or corpus re-embedding, we propose GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play approach that leverages LLM-based query rewriting to bridge these gaps. Unlike existing methods that lack explicit supervision, GRAPE integrates ranking signals into the rewriting LLM via Grouped Relative Policy Optimization (GRPO), ensuring rewritten queries are better aligned with the frozen retriever's latent distribution. Crucially, we identify a score inflation phenomenon in naive similarity-based finetuning - where irrelevant candidates receive indiscriminately high scores - and mitigate it with a novel corpus-relative ranking-based reward. Extensive experiments across multilingual (Flickr30k-CN, CVLUE, XM3600), long-form (Wikipedia), and multimodal (CIRR) benchmarks demonstrate that GRAPE consistently improves performance, achieving an average gain of 4.9% in Recall@10 without any modification to the underlying retriever. The code is available at https://github.com/mogulzhang/GRAPE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes GRAPE, a plug-and-play method for improving CLIP-based retrieval under distributional shifts (multilingual, long-form, multimodal) by using an LLM to rewrite queries. It integrates ranking signals from a frozen retriever into the LLM via Grouped Relative Policy Optimization (GRPO) and introduces a corpus-relative ranking-based reward to address score inflation observed in naive similarity finetuning. Experiments on benchmarks such as Flickr30k-CN, CVLUE, XM3600, Wikipedia, and CIRR report an average 4.9% gain in Recall@10 without modifying the retriever.

Significance. If the central claims hold, the work provides a practical way to adapt retrieval systems to query variations without retraining or re-embedding the corpus, which is valuable for real-world multimodal and multilingual applications. The use of GRPO with a ranking-derived reward and the public code release are positive aspects that support reproducibility and potential adoption.

major comments (3)
  1. [Abstract and §4] Abstract and §4: The central claim that the corpus-relative ranking-based reward specifically mitigates score inflation (indiscriminate high scores on irrelevants) is asserted but not isolated. Downstream Recall@10 gains alone do not demonstrate this; an ablation comparing the proposed reward against naive similarity finetuning and alternatives (e.g., margin-based or contrastive rewards) is needed to verify the mitigation effect and rule out new biases.
  2. [§3.2] §3.2 (GRPO and reward definition): The reward is described as corpus-relative and ranking-based, but it is unclear whether it requires full corpus access during training or only sampled groups. This detail is load-bearing for the practicality claim that the method remains efficient at inference without full corpus access.
  3. [§4] §4 (experiments): No details are provided on statistical significance testing, exact data splits, or hyperparameter tuning for GRPO and the reward. This undermines confidence in the reported 4.9% average Recall@10 gain and cross-benchmark consistency.
minor comments (2)
  1. [Abstract] The abstract mentions 'extensive experiments' but would benefit from a brief summary table of per-benchmark gains and baselines in the main text for quick assessment.
  2. [§3] Notation for the reward function (e.g., how relative ranking is computed within groups) should be formalized with an equation to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4: The central claim that the corpus-relative ranking-based reward specifically mitigates score inflation (indiscriminate high scores on irrelevants) is asserted but not isolated. Downstream Recall@10 gains alone do not demonstrate this; an ablation comparing the proposed reward against naive similarity finetuning and alternatives (e.g., margin-based or contrastive rewards) is needed to verify the mitigation effect and rule out new biases.

    Authors: We acknowledge that the manuscript identifies the score inflation phenomenon in naive similarity-based finetuning and positions the corpus-relative ranking reward as a mitigation strategy, but the current results focus on end-to-end Recall@10 improvements rather than isolating the reward's specific contribution. To directly address this, we will add a dedicated ablation study in the revised version comparing GRAPE's reward against naive similarity finetuning as well as margin-based and contrastive alternatives. This will provide clearer evidence for the mitigation effect and help rule out alternative explanations or new biases. revision: yes

  2. Referee: [§3.2] §3.2 (GRPO and reward definition): The reward is described as corpus-relative and ranking-based, but it is unclear whether it requires full corpus access during training or only sampled groups. This detail is load-bearing for the practicality claim that the method remains efficient at inference without full corpus access.

    Authors: We thank the referee for highlighting this ambiguity. The reward is computed using sampled groups of candidates drawn from the corpus during training; relative rankings are determined only within each sampled group rather than requiring access to the entire corpus. This sampling-based design keeps training efficient and preserves the plug-and-play property at inference, where the method requires no corpus access beyond the frozen retriever. We will revise §3.2 to state this explicitly. revision: yes

  3. Referee: [§4] §4 (experiments): No details are provided on statistical significance testing, exact data splits, or hyperparameter tuning for GRPO and the reward. This undermines confidence in the reported 4.9% average Recall@10 gain and cross-benchmark consistency.

    Authors: We agree that additional experimental details are necessary to support reproducibility and confidence in the reported gains. In the revised manuscript we will include: (i) results of statistical significance testing with standard deviations across multiple runs, (ii) the precise data splits employed for each benchmark, and (iii) the hyperparameter search and final values used for GRPO and the reward function. These additions will better substantiate the average 4.9% Recall@10 improvement and its consistency across benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines GRAPE via GRPO applied to an LLM query rewriter, with the reward explicitly constructed from corpus-relative rankings produced by the separate frozen retriever. This reward signal is external to the policy being optimized and is not obtained by fitting parameters to the target Recall@10 metric or by re-expressing the alignment claim in terms of itself. Performance improvements are reported as empirical measurements on distinct multilingual, long-form, and multimodal test sets rather than as quantities forced by the reward definition. No self-definitional equations, fitted-input predictions, load-bearing self-citations, or imported uniqueness theorems appear in the abstract or described method; the central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard assumptions about reinforcement learning from ranking feedback and the existence of a fixed retriever distribution; no new entities or heavily fitted parameters are introduced in the abstract.

axioms (1)
  • domain assumption Ranking signals from the frozen retriever provide useful supervision for query rewriting.
    Invoked when GRPO is applied to align rewritten queries with the retriever.

pith-pipeline@v0.9.0 · 5786 in / 1165 out tokens · 35894 ms · 2026-05-18T12:14:26.044069+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    naively finetuning LLM with similarity scores can lead to score inflation... we propose a corpus-relative ranking-based reward, which explicitly aligns optimization with ranking metrics while suppressing spurious score inflation

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Alt- clip: Altering the language encoder in clip for extended language capabilities.arXiv preprint arXiv:2211.06679,

    Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, and Ledell Wu. Alt- clip: Altering the language encoder in clip for extended language capabilities.arXiv preprint arXiv:2211.06679,

  2. [2]

    Meta clip 2: A worldwide scaling recipe

    Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al. Metaclip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062,

  3. [3]

    Llm2clip: Powerful language model unlock richer visual representation.arXiv preprint arXiv:2411.04997,

    Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, et al. Llm2clip: Powerful language model unlock richer visual representation.arXiv preprint arXiv:2411.04997,

  4. [4]

    Leveraging audio and visual recurrence for unsu- pervised video highlight detection

    Zahidul Islam, Sujoy Paul, and Mrigank Rochan. Leveraging audio and visual recurrence for unsu- pervised video highlight detection. InNeurIPS 2024 Workshop: Self-Supervised Learning-Theory and Practice. Kenan Jiang, Xuehai He, Ruize Xu, and Xin Eric Wang. Comclip: Training-free compositional image and text matching.arXiv preprint arXiv:2211.13854,

  5. [5]

    Interactive text- to-image retrieval with large language models: A plug-and-play approach.arXiv preprint arXiv:2406.03411,

    Saehyung Lee, Sangwon Yu, Junsung Park, Jihun Yi, and Sungroh Yoon. Interactive text- to-image retrieval with large language models: A plug-and-play approach.arXiv preprint arXiv:2406.03411,

  6. [6]

    Zero-shot clustering of embeddings with self-supervised learnt encoders

    Scott C Lowe, Joakim Bruslund Haurum, Sageev Oore, Thomas B Moeslund, and Graham W Taylor. Zero-shot clustering of embeddings with self-supervised learnt encoders. In4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023),

  7. [7]

    Learning clustering-based prototypes for compositional zero-shot learning.arXiv preprint arXiv:2502.06501, 2025a

    Hongyu Qu, Jianan Wei, Xiangbo Shu, and Wenguan Wang. Learning clustering-based prototypes for compositional zero-shot learning.arXiv preprint arXiv:2502.06501, 2025a. Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, and Gang Xiong. Proapo: Progressively automatic prompt optimization for visual classification. InPro- ceedin...

  8. [8]

    Finetuning clip to reason about pairwise differences.arXiv preprint arXiv:2409.09721,

    Dylan Sam, Devin Willmott, Joao D Semedo, and J Zico Kolter. Finetuning clip to reason about pairwise differences.arXiv preprint arXiv:2409.09721,

  9. [9]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,

  10. [10]

    Crossmodal-3600: A massively multilingual multimodal evaluation dataset.arXiv preprint arXiv:2205.12522,

    Ashish V Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset.arXiv preprint arXiv:2205.12522,

  11. [11]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

  12. [12]

    Scaling pre-training to one hundred billion data for vision language models.arXiv preprint arXiv:2502.07617, 2025a

    Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, and Xiaohua Zhai. Scaling pre-training to one hundred billion data for vision language models.arXiv preprint arXiv:2502.07617, 2025a. Yuxuan Wang, Yijun Liu, Fei Yu, Chen Huang, Kexin Li, Zhiguo Wan, Wanxiang Che, and Hongyang Chen. Cvlue: A new benchmark dataset for chinese vision-languag...

  13. [13]

    Lotlip: Improving language-image pre-training for long text understanding.arXiv preprint arXiv:2410.05249, 2024

    Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, and Zheng-Jun Zha. Lotlip: Improving language-image pre-training for long text understanding.arXiv preprint arXiv:2410.05249,

  14. [14]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  15. [15]

    International Conference on Learning Representations (ICLR) , year =

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936,

  16. [16]

    Hierarchical cross- modal prompt learning for vision-language models.arXiv preprint arXiv:2507.14976,

    Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, and Zhenhua Huang. Hierarchical cross- modal prompt learning for vision-language models.arXiv preprint arXiv:2507.14976,