arxiv: 2604.22722 · v1 · submitted 2026-04-24 · 💻 cs.IR · cs.AI· cs.LG

Recognition: unknown

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Rajinder Sandhu , Di Mu , Cheng Chang , Md Shahriar Tasjid , Himanshu Rai , Maksims Volkovs , Ga Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:10 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG

keywords dense retrievalutility alignmentLLM distillationRAGbi-encoderperplexityInfoNCEQASPER

0 comments

The pith

Training a bi-encoder to match LLM perplexity utility distributions produces dense retrieval that is more accurate and 180 times faster than re-ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dense retrieval relies on vector similarity that often misses documents most useful for generation, while LLM re-ranking captures true utility but at high test-time cost. This work reframes retrieval as matching a utility distribution derived from how much each document reduces an LLM's perplexity on the query. A standard bi-encoder is then trained with a modulated contrastive objective to embed these graded signals directly. The result is a fixed embedding space that ranks documents by generative usefulness without any LLM calls during search. On the QASPER benchmark this yields large gains in recall and MAP while delivering over 180x speedup relative to efficient re-ranking baselines.

Core claim

Retrieval utility can be expressed as the distribution of perplexity reductions that candidate passages induce in an LLM; a bi-encoder trained to imitate this distribution via a Utility-Modulated InfoNCE loss embeds graded usefulness signals into its representation space, allowing standard dense retrieval to recover the precision of LLM-based re-ranking without requiring LLM inference at query time.

What carries the argument

Utility-Modulated InfoNCE objective that modulates standard contrastive loss with per-document utility weights obtained from LLM perplexity reduction.

If this is right

Recall@1 rises by 30.59 percent and MAP by 30.16 percent over a strong semantic baseline on QASPER.
Token F1 improves by 17.3 percent while preserving competitive end-to-end performance.
Inference runs more than 180 times faster than efficient LLM re-ranking methods.
Graded utility signals reside in the embedding space and require no test-time LLM access.
The approach directly supports large-scale RAG deployments that need both precision and throughput.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation procedure could be applied to other retrieval benchmarks to check whether the utility alignment generalizes beyond QASPER.
If the learned embeddings remain effective across different LLMs, they could serve as a reusable utility prior for multiple generation models.
RAG pipelines that swap standard dense retrievers for UAE embeddings would see lower latency and potentially higher answer quality at the same index size.
One could test whether the method still works when the utility signal is derived from a different proxy than perplexity reduction.

Load-bearing premise

A bi-encoder can accurately imitate the graded utility distribution coming from LLM perplexity reduction without substantial information loss or distillation bias.

What would settle it

On the QASPER test set the UAE model shows no gain in Recall@1 or Token F1 relative to the BGE-Base baseline, or its latency advantage over LLM re-ranking vanishes while accuracy stays comparable.

Figures

Figures reproduced from arXiv: 2604.22722 by Cheng Chang, Di Mu, Ga Wu, Himanshu Rai, Maksims Volkovs, Md Shahriar Tasjid, Rajinder Sandhu.

**Figure 1.** Figure 1: Efficiency vs. Performance. UAE (Read Star) occu view at source ↗

**Figure 2.** Figure 2: Overview of Utility-Aligned Embeddings (UAE). Utility is distilled offline into a reward model (Stage A), which defines view at source ↗

**Figure 4.** Figure 4: Alignment of various retrieval models with the view at source ↗

**Figure 5.** Figure 5: Zero-shot transfer performance. The model was view at source ↗

read the original abstract

Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UAE distills LLM perplexity utility into a bi-encoder with a modulated InfoNCE loss and claims large QASPER gains plus big speedups, but the source of those gains is not yet clear.

read the letter

The paper's main move is to treat retrieval as matching a utility distribution from LLM perplexity reduction, then train a bi-encoder with a Utility-Modulated InfoNCE objective that injects those graded signals. On QASPER it reports roughly 30% lifts in Recall@1 and MAP over BGE-Base while running 180x faster than LLM rerankers. That speed edge is the clearest practical payoff if it survives scrutiny. The formulation itself is a straightforward extension of contrastive distillation, and they give credit to the InfoNCE baseline and prior utility reranking work. The experiments appear to use standard benchmarks and a strong semantic baseline, which is better than many retrieval papers. The citation pattern is clean and focused on the relevant IR and distillation literature. The soft spot is attribution. The stress-test concern holds up on the description given: modulating InfoNCE reweights negatives but does not automatically guarantee that the student embedding space preserves the full graded utility ranking from the noisy teacher signal. Without seeing detailed ablations on the modulation parameter, alternative negative sampling, or direct comparisons to plain contrastive training on the same data, it is hard to know how much of the reported lift comes from the new objective versus better data handling or training tricks. The assumption that a bi-encoder can faithfully imitate the LLM utility distribution without major information loss or bias is load-bearing and not yet strongly evidenced in the abstract-level claims. This is aimed at RAG and dense retrieval groups who want to avoid test-time LLM costs. A reader working on distillation or utility-aware IR would find the objective worth looking at. The work is coherent enough on its own terms to deserve a serious referee who can check the full experimental controls and generalization.

Referee Report

2 major / 2 minor

Summary. The paper proposes Utility-Aligned Embeddings (UAE), a distillation framework that trains a bi-encoder dense retriever to match a utility distribution derived from LLM perplexity reduction. It introduces a Utility-Modulated InfoNCE objective to inject graded utility signals into the embedding space without test-time LLM inference. On the QASPER benchmark, UAE reports 30.59% Recall@1, 30.16% MAP, and 17.3% Token F1 gains over the BGE-Base semantic baseline while claiming over 180x speedup relative to efficient LLM re-ranking methods.

Significance. If the central claim holds, the work has clear significance for practical RAG pipelines: it offers a path to combine the precision of utility-based retrieval with the inference speed of bi-encoders. The empirical speed-accuracy tradeoff and the idea of distilling LLM-derived graded signals into static embeddings are potentially impactful contributions to the dense retrieval literature.

major comments (2)

[Methods (Utility-Modulated InfoNCE)] Methods section (Utility-Modulated InfoNCE formulation): the claim that the bi-encoder faithfully imitates the LLM-derived utility distribution rests on the modulation step, yet no analysis is provided showing that student similarities preserve the teacher's graded ranking (e.g., via rank correlation or utility-weighted retrieval metrics). Standard InfoNCE already optimizes hard negatives; without this diagnostic it is unclear whether the reported QASPER gains arise from true utility alignment or from generic contrastive training on the same data.
[§4 (QASPER experiments)] §4 (QASPER experiments): the 30%+ lifts over BGE-Base are load-bearing for the central claim, but the manuscript lacks ablations that isolate the modulation parameter and the utility distribution source. A direct comparison to an unmodulated InfoNCE baseline trained on identical data would be required to attribute gains to LLM utility rather than data selection or training schedule.

minor comments (2)

[Abstract / §3] Abstract and §3: the modulation parameter is referenced but never defined or given a default value; a short equation or pseudocode snippet would improve clarity.
[Results tables] Table 1 (or equivalent results table): reporting only point estimates without standard deviations across seeds makes it hard to judge whether the 30% relative gains are statistically reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and have revised the manuscript to incorporate additional analyses and ablations as requested.

read point-by-point responses

Referee: [Methods (Utility-Modulated InfoNCE)] Methods section (Utility-Modulated InfoNCE formulation): the claim that the bi-encoder faithfully imitates the LLM-derived utility distribution rests on the modulation step, yet no analysis is provided showing that student similarities preserve the teacher's graded ranking (e.g., via rank correlation or utility-weighted retrieval metrics). Standard InfoNCE already optimizes hard negatives; without this diagnostic it is unclear whether the reported QASPER gains arise from true utility alignment or from generic contrastive training on the same data.

Authors: We agree that explicit diagnostics are needed to confirm preservation of graded utility signals. In the revised manuscript we have added a new analysis subsection that reports Spearman's rank correlation between bi-encoder similarities and LLM-derived utility scores on a held-out set, along with utility-weighted retrieval metrics. These diagnostics show higher correlation under the modulated objective than under standard InfoNCE, supporting that the QASPER gains arise from utility alignment rather than generic contrastive training. revision: yes
Referee: [§4 (QASPER experiments)] §4 (QASPER experiments): the 30%+ lifts over BGE-Base are load-bearing for the central claim, but the manuscript lacks ablations that isolate the modulation parameter and the utility distribution source. A direct comparison to an unmodulated InfoNCE baseline trained on identical data would be required to attribute gains to LLM utility rather than data selection or training schedule.

Authors: We acknowledge that isolating the modulation effect requires a controlled ablation. The revised manuscript now includes a direct comparison in §4 between UAE and an unmodulated InfoNCE baseline trained on the identical data, negatives, and schedule. The modulated variant outperforms the unmodulated baseline, confirming the contribution of the LLM-derived utility distribution beyond data selection or training details. We have also clarified the utility distribution construction in the methods. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation: standard distillation setup remains independent of reported metrics

full rationale

The paper's core derivation formulates retrieval as distribution matching and trains a bi-encoder via Utility-Modulated InfoNCE to imitate an LLM-derived utility distribution from perplexity reduction. This is a conventional teacher-student distillation procedure with no equations that define the output in terms of itself, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. The QASPER gains (Recall@1 +30.59%, etc.) are presented as empirical outcomes rather than algebraic consequences of the training objective. The method is self-contained against external benchmarks and does not reduce any claimed result to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that utility can be distilled from LLM perplexity into embeddings via contrastive training. No new entities are invented.

free parameters (1)

modulation parameter in Utility-Modulated InfoNCE
The objective likely has a parameter to modulate the utility signal, which may be tuned.

axioms (1)

domain assumption Perplexity reduction is a valid proxy for document utility in LLM generation.
The method derives utility from perplexity reduction.

pith-pipeline@v0.9.0 · 5517 in / 1401 out tokens · 42424 ms · 2026-05-08T10:10:12.463255+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu
[2]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216

work page internal anchor Pith review arXiv
[3]

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The Power of Noise: Redefining Retrieval for RAG Systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

2024
[4]

Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, and Hui Xiong. 2025. SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction. InThe Thirteenth International Conference on Learning Representations

2025
[5]

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gard- ner. 2021. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies. 4599–4610

2021
[6]

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 150–158

2024
[7]

Aritra Ghosh, Himanshu Kumar, and P.S. Sastry. 2017. Robust Loss Functions un- der Label Noise for Deep Neural Networks. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 31

2017
[8]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783

work page internal anchor Pith review arXiv 2024
[9]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. InInternational Confer- ence on Learning Representations

2021
[10]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. InInternational Conference on Learning Representations (ICLR)

2020
[11]

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations

2022
[12]

Oz Huly, Idan Pogrebinsky, David Carmel, Oren Kurland, and Yoelle Maarek. 2024. Old IR Methods Meet RAG. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 2559–2563

2024
[13]

Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, and Bing Qin. 2025. GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis. InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics. Association for Computational Linguistics, 10746–10757

2025
[14]

Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik. 2025. Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG. InThe Thir- teenth International Conference on Learning Representations

2025
[15]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 6769–6781

2020
[16]

Zixuan Ke, Weize Kong, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. 2024. Bridging the Preference Gap between Retrievers and LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 10438–10451

2024
[17]

Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789

work page arXiv 2024
[18]

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. InThe Thirteenth International Conference on Learning Representations

2025
[19]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. 74–81

2004
[20]

Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. 2025. HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval- Augmented Generation. InFindings of the Association for Computational Linguis- tics: ACL 2025. Association for Computational Linguistics, 1897–1913

2025
[21]

Robert B. Miller. 1968. Response time in man-computer conversational transac- tions. InProceedings of the December 9-11, 1968, Fall Joint Computer Conference, Part I. Association for Computing Machinery, 267–277

1968
[22]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2383–2392

2016
[23]

Alireza Salemi and Hamed Zamani. 2024. Evaluating Retrieval Quality in Retrieval-Augmented Generation. InProceedings of the 47th International ACM SI- GIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 2395–2400

2024
[24]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguisti...

2022
[25]

Chenze Shao, Yang Feng, Jinchao Zhang, Fandong Meng, and Jie Zhou. 2021. Sequence-level training for non-autoregressive neural machine translation.Com- putational Linguistics47, 4 (2021), 891–925

2021
[26]

Yucheng Shi, Qiaoyu Tan, Xuansheng Wu, Shaochen Zhong, Kaixiong Zhou, and Ninghao Liu. 2024. Retrieval-enhanced Knowledge Editing in Language Models for Multi-Hop Question Answering. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, 2056–2066

2024
[27]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investi- gating Large Language Models as Re-Ranking Agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 14918–14937

2023
[28]

Fangzheng Tian, Debasis Ganguly, and Craig Macdonald. 2025. Is Relevance Propagated from Retriever to Generator in RAG?. InAdvances in Information Retrieval. Springer Nature Switzerland, 32–48

2025
[29]

Fangzheng Tian, Debasis Ganguly, and Craig Macdonald. 2026. Predicting Re- trieval Utility and Answer Quality in Retrieval-Augmented Generation.arXiv preprint arXiv:2601.14546(2026)

work page arXiv 2026
[30]

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A Machine Comprehension Dataset. InProceedings of the 2nd Workshop on Representation Learning for NLP. 191–200

2017
[31]

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific Claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 7534– 7550

2020
[32]

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, 641–649

2024
[33]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InConference on Empirical Methods in Natural Language Processing (EMNLP)

2018
[34]

Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. 2024. Making Retrieval- Augmented Language Models Robust to Irrelevant Context. InThe Twelfth Inter- national Conference on Learning Representations

2024
[35]

Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, and Xueqi Cheng. 2025. Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation. InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. Associatio...

2025