pith. machine review for the scientific record. sign in

arxiv: 2604.23734 · v1 · submitted 2026-04-26 · 💻 cs.IR

Recognition: unknown

Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:11 UTC · model grok-4.3

classification 💻 cs.IR
keywords rerankeragentic retrievalcontribution statementevidence passageLLM-as-JudgeBEIR benchmarkRAGinformation retrieval
0
0 comments X

The pith

Reranker models can jointly output relevance judgments, contribution statements, and evidence passages to aid agentic retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prism-Reranker is a family of models built on Qwen3.5 that, for documents judged relevant, also emit a contribution statement summarizing how the document helps the query and an evidence passage that preserves every relevant signal while discarding noise. This design targets the needs of retrieval-augmented generation and autonomous agents that currently receive only scalar scores and must ingest full documents. Training uses a hybrid objective of point-wise distillation from a commercial reranker API combined with supervised fine-tuning on the new contribution and evidence targets, with data relabeled by an LLM-as-Judge ensemble to obtain consistent binary supervision. The models deliver solid results across four sizes on a QA subset of BEIR and raise the NDCG@10 of Qwen3-Reranker-4B by 1.54 when the same recipe is applied.

Core claim

Prism-Reranker shows that a single reranker can emit a yes/no relevance verdict together with a contribution statement and a self-contained evidence passage whenever the verdict is yes, achieved through hybrid distillation plus supervised fine-tuning on curated and LLM-relabeled data, while preserving competitive ranking quality on BEIR QA tasks and enabling direct augmentation of existing LLM rerankers.

What carries the argument

Joint emission of relevance verdict, contribution statement, and noise-free evidence passage, enabled by a hybrid training objective that mixes commercial reranker distillation with supervised fine-tuning on the additional targets.

If this is right

  • Agents can substitute full documents with the provided evidence passages and contribution statements, reducing context length and tangential content.
  • The same training recipe transfers to other base LLM rerankers and yields measurable gains on standard NDCG@10 metrics.
  • Keyword-style query reformulations in training data make the model more robust to the short, agent-generated queries common in real traffic.
  • Performance holds across model scales from 0.8B to 9B parameters without requiring separate architectures for the extra outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • RAG pipelines could cut token usage by feeding agents only the generated evidence passages instead of complete retrieved documents.
  • Downstream task accuracy might serve as an additional training signal to further refine contribution and evidence quality beyond LLM judges.
  • The joint-output format could extend to related retrieval settings such as multi-document synthesis or personalized result presentation.

Load-bearing premise

LLM-as-Judge ensembles produce reliable, unbiased binary relevance labels and accurate quality judgments for contribution and evidence outputs without systematic bias.

What would settle it

A human evaluation in which raters find that the generated contribution statements or evidence passages frequently omit critical query-relevant information or introduce inaccuracies not present in the source documents.

Figures

Figures reproduced from arXiv: 2604.23734 by Dun Zhang.

Figure 1
Figure 1. Figure 1: Three outputs from one forward pass. The same LM head is reused for (i) reading the view at source ↗
Figure 2
Figure 2. Figure 2: Data construction pipeline. Open-source IR corpora and live web pages retrieved view at source ↗
Figure 3
Figure 3. Figure 3: Inter-annotator agreement among LLM judges, measured as Cohen’s view at source ↗
Figure 4
Figure 4. Figure 4: Joint distribution of training pairs over teacher score view at source ↗
Figure 5
Figure 5. Figure 5: Compression statistics of the evidence field on the held-out dev set ( view at source ↗
read the original abstract

Modern retrieval pipelines increasingly serve downstream consumers like retrieval-augmented generation (RAG) and autonomous agents that need more than a scalar relevance score. A reranker that only tells the caller "how relevant" forces the agent to dump entire documents into the language-model context, wasting tokens on tangential passages and boilerplate. We introduce Prism-Reranker, a family of reranker models built on Qwen3.5 at four sizes (0.8B, 2B, 4B, 9B) that goes beyond scalar scoring. In addition to the standard yes/no relevance judgement, whenever the verdict is yes the model emits (i) a contribution statement summarizing how the document helps the query, and (ii) an evidence passage: a self-contained rewrite that preserves every query-relevant signal while discarding noise. Prism-Reranker is trained with a hybrid objective combining point-wise distillation from a strong commercial reranker API with supervised fine-tuning on contribution and evidence targets. We curate training data from KaLM-Embedding's open-source aggregation, augmented with real web documents retrieved via commercial search APIs for open-domain queries and LLM-synthesized variants, and rewrite a portion of queries into keyword-style reformulations to adapt the model to agent-issued traffic. To reconcile inconsistent labels across open corpora and obtain crisp binary supervision, we relabel data with an LLM-as-Judge ensemble aggregating votes from five frontier LLMs. On a QA subset of BEIR and on an LLM-judged evaluation of contribution and evidence quality, Prism-Reranker attains solid results across all four sizes. We further show that the same recipe extends existing LLM-based rerankers, augmenting Qwen3-Reranker-4B with contribution and evidence capabilities while improving its average BEIR-QA NDCG@10 by +1.54 over the base model. Model weights, training recipe, and evaluation suite are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces Prism-Reranker, a family of Qwen3.5-based reranker models (0.8B, 2B, 4B, 9B) that output binary relevance judgments plus, for relevant documents, a contribution statement summarizing how the document helps the query and an evidence passage that preserves relevant signals while discarding noise. Training combines point-wise distillation from a commercial reranker API with supervised fine-tuning on contribution/evidence targets. Data curation draws from KaLM-Embedding, commercial web retrievals, and LLM-synthesized variants; binary labels are obtained via an LLM-as-Judge ensemble of five frontier models, and a portion of queries are rewritten into keyword style. Experiments report competitive NDCG@10 on a BEIR QA subset, LLM-judged quality scores for the new outputs, and a +1.54 average NDCG@10 lift when the same recipe is applied to Qwen3-Reranker-4B. Model weights, training recipe, and evaluation suite are released.

Significance. If the reported gains hold, the work is significant for information retrieval because it directly addresses the token-waste problem in agentic RAG pipelines by supplying structured, query-focused outputs instead of scalar scores. The release of weights, recipe, and evaluation artifacts is a clear strength that supports reproducibility and follow-on work. The hybrid objective and data-curation pipeline are clearly motivated and the central empirical claims appear internally consistent with the described experiments on public BEIR data.

major comments (1)
  1. [§4 (Experiments and Evaluation)] §4 (Experiments and Evaluation): The contribution and evidence quality metrics are obtained exclusively via the same class of LLM judges used for training-data relabeling. Although the BEIR NDCG@10 numbers are independent, the quality scores could be inflated by alignment with the judges' own preferences. An inter-annotator agreement analysis among the five judges or a small-scale human validation study would be needed to confirm that the quality claims are not circular.
minor comments (3)
  1. [Abstract] Abstract: the statement that the model 'attains solid results across all four sizes' is vague; reporting the actual NDCG@10 ranges or deltas for the BEIR QA subset would give readers a concrete summary of the main result.
  2. [Data curation] Data curation paragraph: the keyword-style query reformulation step is mentioned but no count or ablation is provided; adding the fraction of queries rewritten and its measured effect on NDCG would clarify its contribution.
  3. [Method] Notation: the distinction between 'contribution statement' and 'evidence passage' is clear in the abstract but should be reinforced with an explicit example in the method section to avoid any reader confusion about output format.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below, committing to revisions where appropriate while remaining honest about scope limitations.

read point-by-point responses
  1. Referee: [§4 (Experiments and Evaluation)] §4 (Experiments and Evaluation): The contribution and evidence quality metrics are obtained exclusively via the same class of LLM judges used for training-data relabeling. Although the BEIR NDCG@10 numbers are independent, the quality scores could be inflated by alignment with the judges' own preferences. An inter-annotator agreement analysis among the five judges or a small-scale human validation study would be needed to confirm that the quality claims are not circular.

    Authors: We agree that using the same ensemble of five frontier LLMs for both training-data relabeling and for scoring contribution/evidence quality introduces a legitimate risk of circularity or preference alignment. The BEIR NDCG@10 results are unaffected because they rely on the benchmark's independent ground-truth relevance labels. To directly address the concern, we will add an inter-annotator agreement analysis (e.g., Fleiss' kappa and pairwise agreement rates) computed on a held-out sample of the evaluation set in the revised manuscript; this analysis has already been performed and shows substantial agreement among the judges. A small-scale human validation study would provide the strongest external check, but was outside the original experimental budget and timeline. We will therefore note the absence of human validation explicitly as a limitation and suggest it as future work. These changes will be reflected in §4 and the limitations section. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical training pipeline for Prism-Reranker using hybrid distillation from commercial APIs plus SFT on contribution/evidence targets, followed by evaluation on public BEIR QA subsets and LLM-judged quality metrics. No equations, uniqueness theorems, or derivation steps are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on benchmark numbers (including the +1.54 NDCG@10 lift) that are independently verifiable outside the model's own training signals, with data curation and relabeling described as standard preprocessing rather than tautological. This is a standard empirical ML paper whose results do not collapse into their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised fine-tuning and distillation assumptions plus the domain-specific assumption that LLM judges can supply reliable supervision for the new output types.

axioms (1)
  • ad hoc to paper LLM-as-Judge ensemble produces reliable binary relevance labels and quality assessments for contribution and evidence
    Invoked for reconciling inconsistent open-corpus labels and for the LLM-judged evaluation of outputs.

pith-pipeline@v0.9.0 · 5649 in / 1273 out tokens · 75628 ms · 2026-05-08T05:11:31.665417+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 21 canonical work pages · 5 internal anchors

  1. [1]

    Self-RAG: Learning to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, A virup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In International Con- ference on Learning Representations (ICLR) , 2024. 19

  2. [2]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-Embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216 , 2024

  3. [3]

    xRAG: Extreme context compression for retrieval-augmented generation with one token

    Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. xRAG: Extreme context compression for retrieval-augmented generation with one token. arXiv preprint arXiv:2405.13792 , 2024

  4. [4]

    Exa: Neural search API

    Exa. Exa: Neural search API. https://exa.ai/, 2024

  5. [5]

    Clarke, Gianluca Demartini, Matthias Ha- gen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth

    Guglielmo Faggioli, Laura Dietz, Charles L.A. Clarke, Gianluca Demartini, Matthias Ha- gen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. Perspectives on large language models for relevance judgment. In Proceedings of the 2023 ACM SIGIR International Conference on the Theory of Infor- mation Retrieva...

  6. [6]

    Improving efficient neural ranking models with cross-architecture knowledge distilla- tion

    Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Han- bury. Improving efficient neural ranking models with cross-architecture knowledge distilla- tion. arXiv preprint arXiv:2010.02666 , 2020

  7. [7]

    KaLM- Embedding: Superior training data brings a stronger embedding model, 2025

    Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Haofen Wang, Jun Yu, and Min Zhang. KaLM- Embedding: Superior training data brings a stronger embedding model, 2025. URL https: //arxiv.org/abs/2501.01028

  8. [8]

    Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, SeungYoon Han, and Jong C. Park. EXIT: Context-aware extractive compression for enhancing retrieval-augmented gen- eration. arXiv preprint arXiv:2412.12559 , 2024

  9. [9]

    overthink

    Nour Jedidi, Yung-Sung Chuang, James Glass, and Jimmy Lin. Don’t “overthink” passage reranking: Is reasoning truly necessary? arXiv preprint arXiv:2505.16886 , 2025

  10. [10]

    LLMLingua: Compressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2023

  11. [11]

    LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , 2024

  12. [12]

    KaLM-embedding-finetuning-data

    KaLM-Embedding Team. KaLM-embedding-finetuning-data. https://huggingface. co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data , 2025. HuggingFace dataset card

  13. [13]

    Dense passage retrieval for open-domain question answer- ing

    Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answer- ing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  14. [14]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS) , 2020. 20

  15. [15]

    11 Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn

    Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L. Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse. arXiv preprint arXiv:2410.21333 , 2024

  16. [16]

    Re- thinking reasoning in document ranking: Why chain-of-thought falls short

    Xuan Lu, Haohang Huang, Rui Meng, Yaohui Jin, Wenjun Zeng, and Xiaoyu Shen. Re- thinking reasoning in document ranking: Why chain-of-thought falls short. arXiv preprint arXiv:2510.08985, 2025

  17. [17]

    Fine-tuning LLaMA for multi-stage text retrieval

    Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning LLaMA for multi-stage text retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) , 2024

  18. [18]

    Smith, Edoardo M

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316 , 2022. doi: 10.48550/ARXIV. 2210.07316. URL https://arxiv.org/abs/2210.07316

  19. [19]

    Passage Re-ranking with BERT

    Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085, 2019

  20. [20]

    Document ranking with a pretrained sequence-to-sequence model

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020 , 2020

  21. [21]

    Vicky Zhao, Lili Qiu, and Dongmei Zhang

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024 , 2024

  22. [22]

    arXiv preprint arXiv:2309.15088 , year=

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. RankVicuna: Zero-shot listwise document reranking with open-source large language models. arXiv preprint arXiv:2309.15088, 2023

  23. [23]

    RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. RankZephyr: Effective and robust zero-shot listwise reranking is a breeze! arXiv preprint arXiv:2312.02724 , 2023

  24. [24]

    Qwen3.5: Foundation models for the open community

    Qwen Team. Qwen3.5: Foundation models for the open community. https://qwen.ai/ blog?id=qwen3.5, 2026. Qwen Team blog post

  25. [25]

    FIRST: Faster improved listwise reranking with single token decoding

    Revanth Gangi Reddy, JaeHyeok Doo, Yifei Xu, Md Arafat Sultan, Deevya Swain, A virup Sil, and Heng Ji. FIRST: Faster improved listwise reranking with single token decoding. arXiv preprint arXiv:2406.15657 , 2024

  26. [26]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  27. [27]

    static-similarity-mrl-multilingual-v1

    Sentence Transformers. static-similarity-mrl-multilingual-v1. https://huggingface.co/ sentence-transformers/static-similarity-mrl-multilingual-v1 , 2024. Hugging- Face model card

  28. [28]

    Is ChatGPT good at search? Investigating large language models as re-ranking agents

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is ChatGPT good at search? Investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2023. 21

  29. [29]

    Tavily: Search API for AI agents

    Tavily. Tavily: Search API for AI agents. https://tavily.com/, 2024

  30. [30]

    BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In NeurIPS Datasets and Benchmarks Track , 2021

  31. [31]

    Verga, S

    Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models. arXiv preprint arXiv:2404.18796, 2024

  32. [32]

    Jina-Reranker-v3: Last but not late interaction for listwise document reranking

    Feng Wang, Yuqing Li, and Han Xiao. Jina-Reranker-v3: Last but not late interaction for listwise document reranking. arXiv preprint , 2025

  33. [33]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 , 2022

  34. [34]

    PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization

    Yidong Wang, Zhuohao Yu, Wenjin Yao, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, and Yue Zhang. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. In International Conference on Learning Representations (ICLR) , 2024

  35. [35]

    Learn- ing to filter context for retrieval-augmented generation

    Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. Learn- ing to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377 , 2023

  36. [36]

    C-Pack: Packed Resources For General Chinese Embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian- Yun Nie. C-Pack: Packed resources for general Chinese embeddings. arXiv preprint arXiv:2309.07597, 2023

  37. [37]

    RECOMP: Improving retrieval-augmented LMs with compression and selective augmentation

    Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with compression and selective augmentation. In International Conference on Learning Representations (ICLR), 2024

  38. [38]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR) , 2023

  39. [39]

    Com- pAct: Compressing retrieved documents actively for question answering

    Chanwoong Yoon, Taewhoo Lee, Hyeon Hwang, Minbyul Jeong, and Jaewoo Kang. Com- pAct: Compressing retrieved documents actively for question answering. arXiv preprint arXiv:2407.09014, 2024

  40. [40]

    PosIR: Position-aware heterogeneous information retrieval benchmark, 2026

    Ziyang Zeng, Dun Zhang, Yu Yan, Xu Sun, Cuiqiaoshu Pan, Yudong Zhou, and Yuqing Yang. PosIR: Position-aware heterogeneous information retrieval benchmark, 2026. URL https://arxiv.org/abs/2601.08363

  41. [41]

    Qwen3 embed- ding: Advancing text embedding and reranking through foundation models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models. arXiv preprint, 2025

  42. [42]

    Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian

    Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, and Min Zhang. KaLM-Embedding-V2: Superior training techniques and data inspire a versatile embedding model, 2025. URL https: //arxiv.org/abs/2506.20923. 22

  43. [43]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , 2023

  44. [44]

    Judgelm: Fine-tuned large language models are scalable judges,

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. JudgeLM: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631 , 2023

  45. [45]

    RankT5: Fine-tuning T5 for text ranking with ranking losses

    Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai Hui, Ji Ma, Jing Lu, Jianmo Ni, Xuanhui Wang, and Michael Bendersky. RankT5: Fine-tuning T5 for text ranking with ranking losses. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) , 2023

  46. [46]

    Judge whether the Document meets the require- ments based on the Query and the Instruct provided

    Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. A setwise ap- proach for effective and highly efficient zero-shot ranking with large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (SIGIR) , 2024. A Prompt Template and Output Format This appendix gives t...

  47. [47]

    <contribution>: what the document contributes to the query

  48. [48]

    ཬ୔”ऎุ୔ো଀ӫđఃჅनູऎุ୔ᇕđၹՎ “ཬ୔

    <evidence>: a self-contained rewrite of relevant content. The empty <think></think> block is intentional: it disables the backbone’s chain-of- thought channel at training time, so the very next decoded token is the verdict. This makes the position of ℓyes and ℓno in eq. ( 1) deterministic. 23 A.2 Worked Examples We give two positive examples in different ...