How important is Recall for Measuring Retrieval Quality?

Oleg Vasilyev; Randy Sawaya; Shelly Schwartz

arxiv: 2512.20854 · v2 · submitted 2025-12-24 · 💻 cs.CL · cs.IR

How important is Recall for Measuring Retrieval Quality?

Shelly Schwartz , Oleg Vasilyev , Randy Sawaya This is my paper

Pith reviewed 2026-05-16 20:32 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords retrieval qualityrecallLLM judgmentsresponse qualityinformation retrievalevaluation metricsunknown relevance set

0 comments

The pith

Recall can be replaced by a simple alternative metric for assessing retrieval quality in realistic settings where the total relevant documents are unknown.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In large or changing knowledge bases, the complete set of relevant documents for any query is rarely known, so standard recall cannot be measured. The authors examine how well other retrieval metrics align with the quality of answers produced by language models using those retrieved documents, as judged by another LLM. Across several datasets containing only a small number of relevant items, they find that certain metrics track response quality closely. They also present a straightforward new measure that achieves good results without any knowledge of the full relevant set size.

Core claim

When the total number of relevant documents is unknown, retrieval quality can still be effectively gauged by metrics that do not require this information, as demonstrated by their strong correlation with LLM-assessed response quality from the retrieved set, and a newly introduced simple measure performs particularly well in this regard.

What carries the argument

The simple retrieval quality measure proposed in the paper, which evaluates performance on the retrieved documents without depending on the total number of relevant documents.

If this is right

Metrics that avoid needing the total relevant count can substitute for recall in many evaluation scenarios.
LLM-based response quality judgments correlate well enough with retrieval quality to serve as a proxy.
The new simple measure shows strong performance across the tested datasets with low relevant document counts.
Different established strategies for dealing with unknown recall have varying levels of success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could simplify ongoing monitoring of retrieval systems in environments where relevance sets are incomplete or change frequently.
It opens the possibility of optimizing retrieval directly for downstream response quality rather than traditional metrics.
The findings might apply to other domains where complete ground truth is unavailable, such as web search or enterprise knowledge bases.

Load-bearing premise

LLM-based judgments of response quality serve as a reliable and unbiased proxy for true retrieval quality when recall cannot be computed.

What would settle it

A study that measures the correlation of the new measure against human judgments of response quality, using a dataset where the total number of relevant documents is fully known, to see if it matches or exceeds recall's correlation.

Figures

Figures reproduced from arXiv: 2512.20854 by Oleg Vasilyev, Randy Sawaya, Shelly Schwartz.

**Figure 2.** Figure 2: Distribution of the response score (1 to 5) for embedding models shown on Y-axis. On ARXIV; using [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of the response score (1 to 5) for embedding models shown on Y-axis. On MSMARCO; using [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of the response score (1 to 5) for embedding models shown on Y-axis. On HotpotQA [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Spearman correlation between the retrieval measures ( [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Spearman correlation between F and the response score, on ARXIV [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 9.** Figure 9: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗

**Figure 10.** Figure 10: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗

**Figure 11.** Figure 11: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗

**Figure 12.** Figure 12: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p006_12.png] view at source ↗

**Figure 13.** Figure 13: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p007_13.png] view at source ↗

**Figure 14.** Figure 14: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p007_14.png] view at source ↗

**Figure 15.** Figure 15: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p007_15.png] view at source ↗

**Figure 16.** Figure 16: Difference between the Spearman correla [PITH_FULL_IMAGE:figures/full_fig_p007_16.png] view at source ↗

**Figure 17.** Figure 17: Difference between the Spearman correla [PITH_FULL_IMAGE:figures/full_fig_p007_17.png] view at source ↗

**Figure 18.** Figure 18: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p008_18.png] view at source ↗

**Figure 19.** Figure 19: Distribution of the response score (1 to 5) [PITH_FULL_IMAGE:figures/full_fig_p011_19.png] view at source ↗

**Figure 21.** Figure 21: Snippet for getting ranked texts for a ranked sample [PITH_FULL_IMAGE:figures/full_fig_p013_21.png] view at source ↗

**Figure 22.** Figure 22: Snippet of assertions that can be made for [PITH_FULL_IMAGE:figures/full_fig_p013_22.png] view at source ↗

**Figure 23.** Figure 23: Pearson correlation between the retrieval measures ( [PITH_FULL_IMAGE:figures/full_fig_p014_23.png] view at source ↗

**Figure 24.** Figure 24: Kendall Tau-b correlation between the retrieval measures ( [PITH_FULL_IMAGE:figures/full_fig_p014_24.png] view at source ↗

**Figure 25.** Figure 25: Kendall Tau-c correlation between the retrieval measures ( [PITH_FULL_IMAGE:figures/full_fig_p014_25.png] view at source ↗

**Figure 26.** Figure 26: Spearman correlation between F and the response score, on MSMARCO [PITH_FULL_IMAGE:figures/full_fig_p014_26.png] view at source ↗

**Figure 27.** Figure 27: Spearman correlation between F and the response score, on Natural Questions. B Response Score In Section 3.1 we have shown the distribution of the response score for ARXIV, MSMARCO and HotpotQA-sentences. Here we show the distributions for Natural Questions ( [PITH_FULL_IMAGE:figures/full_fig_p014_27.png] view at source ↗

**Figure 28.** Figure 28: Spearman correlation between F and the response score, on HotpotQA-sentences [PITH_FULL_IMAGE:figures/full_fig_p015_28.png] view at source ↗

**Figure 29.** Figure 29: Spearman correlation between F and the response score, on HotpotQA-paragraphs. len(Sg['rank']) == Sg['K'] len(Sg['inK']) == Sg['K'] sum(Sg['inK']) <= Sg['Np'] Sg['K'] in Sr['K'] Sg['rank'] == Sr['rank'][:Sg['K']] [PITH_FULL_IMAGE:figures/full_fig_p015_29.png] view at source ↗

**Figure 30.** Figure 30: Snippet of assertions that can be made for [PITH_FULL_IMAGE:figures/full_fig_p015_30.png] view at source ↗

**Figure 31.** Figure 31: Spearman correlation between T and the response score, on ARXIV [PITH_FULL_IMAGE:figures/full_fig_p016_31.png] view at source ↗

**Figure 32.** Figure 32: Spearman correlation between T and the response score, on MSMARCO [PITH_FULL_IMAGE:figures/full_fig_p016_32.png] view at source ↗

**Figure 33.** Figure 33: Spearman correlation between T and the response score, on Natural Questions [PITH_FULL_IMAGE:figures/full_fig_p016_33.png] view at source ↗

**Figure 34.** Figure 34: Spearman correlation between T and the response score, on HotpotQA-sentences [PITH_FULL_IMAGE:figures/full_fig_p016_34.png] view at source ↗

**Figure 35.** Figure 35: Spearman correlation between T and the response score, on HotpotQA-paragraphs [PITH_FULL_IMAGE:figures/full_fig_p016_35.png] view at source ↗

**Figure 36.** Figure 36: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p017_36.png] view at source ↗

**Figure 37.** Figure 37: Difference between the Spearman correlations: nDCG-response minus F-response. On HotpotQA-paragraphs [PITH_FULL_IMAGE:figures/full_fig_p017_37.png] view at source ↗

**Figure 40.** Figure 40: Difference between the Spearman correlations: T-response minus nDCG-response. On HotpotQA-paragraphs [PITH_FULL_IMAGE:figures/full_fig_p017_40.png] view at source ↗

**Figure 39.** Figure 39: Difference between the Spearman correla [PITH_FULL_IMAGE:figures/full_fig_p017_39.png] view at source ↗

**Figure 41.** Figure 41: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p018_41.png] view at source ↗

**Figure 42.** Figure 42: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p018_42.png] view at source ↗

**Figure 43.** Figure 43: Difference between the Spearman correlations: [PITH_FULL_IMAGE:figures/full_fig_p018_43.png] view at source ↗

read the original abstract

In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a simple retrieval metric that avoids needing total relevants, but its validation skips checking against actual recall on the very datasets where recall is computable.

read the letter

The core point is that recall often cannot be calculated in large or changing knowledge bases, so the authors test ways around it by correlating retrieval metrics with LLM judgments of response quality from the retrieved documents. They also propose their own straightforward measure that works without knowing the full relevant set. This addresses a practical gap in retrieval evaluation for RAG-style systems, and running the correlations across multiple datasets with small relevant counts (2-15) gives the work some empirical grounding. The new measure is presented as performing well in those correlations, which is the main novelty here. That said, the datasets chosen make recall computable in principle, yet the paper does not report how the new measure stacks up against ground-truth recall values on them. Everything instead hinges on the LLM judgments serving as a reliable proxy for retrieval quality. If those judgments are influenced by generation style, parametric knowledge in the model, or other factors unrelated to what was retrieved, the correlations do not cleanly establish that the metric tracks retrieval performance when totals are unknown. The abstract gives no details on the exact formula or statistical controls, which leaves the strength of the results hard to judge from the summary alone. This work is aimed at people building or evaluating retrieval components in real-world applications where complete relevance sets are unavailable. A reader focused on IR metrics or RAG evaluation would find the discussion and the proposed measure worth considering. It should go to peer review because it identifies a genuine limitation in standard practice and offers an initial alternative with some data behind it, even though the evidence would benefit from a direct comparison to recall where possible.

Referee Report

2 major / 2 minor

Summary. The paper claims that recall is often uncomputable in realistic retrieval settings because the total number of relevant documents is unknown. It evaluates established strategies for handling this by measuring correlations between retrieval quality metrics and LLM-based judgments of response quality (generated from the retrieved documents). Experiments are conducted across multiple datasets with 2-15 relevant documents, and a new simple retrieval quality measure is introduced that performs well without requiring knowledge of the total number of relevant documents.

Significance. If the central claim holds after addressing validation gaps, the work could provide a practical alternative for retrieval evaluation in large or evolving knowledge bases where complete relevance sets are infeasible. The multi-dataset experiments add breadth, and the focus on proxy-based evaluation addresses a real operational constraint in IR systems.

major comments (2)

[Experiments] Experimental description (abstract and methods): Although the datasets contain only 2-15 relevant documents—making ground-truth recall directly computable—the paper reports no direct comparison or correlation of the proposed measure (or baselines) against actual recall values. This is load-bearing for the claim that the measure 'performs well,' because the entire validation rests on unverified LLM judgments as a proxy.
[Results] Validation approach (results and discussion): The performance of the new measure is established solely via correlation with LLM judgments of response quality. No analysis or controls are described to address potential confounds, such as the LLM drawing on parametric knowledge rather than the retrieved documents, which risks the correlations reflecting factors orthogonal to retrieval quality.

minor comments (2)

[Abstract] The abstract gives no concrete definition or formula for the new retrieval quality measure, which should be stated explicitly (with any parameters) in the methods section for reproducibility.
[Results] No mention of statistical significance testing or confidence intervals for the reported correlations, which would clarify whether observed differences are reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the experimental validation.

read point-by-point responses

Referee: [Experiments] Experimental description (abstract and methods): Although the datasets contain only 2-15 relevant documents—making ground-truth recall directly computable—the paper reports no direct comparison or correlation of the proposed measure (or baselines) against actual recall values. This is load-bearing for the claim that the measure 'performs well,' because the entire validation rests on unverified LLM judgments as a proxy.

Authors: We agree that, given the small number of relevant documents in these datasets, a direct comparison to ground-truth recall is feasible and would strengthen the validation. Although the paper's primary motivation concerns realistic settings where the total number of relevant documents is unknown, we will add correlations between the proposed measure (and baselines) and actual recall values in the revised results section to address this point. revision: yes
Referee: [Results] Validation approach (results and discussion): The performance of the new measure is established solely via correlation with LLM judgments of response quality. No analysis or controls are described to address potential confounds, such as the LLM drawing on parametric knowledge rather than the retrieved documents, which risks the correlations reflecting factors orthogonal to retrieval quality.

Authors: We acknowledge the risk of confounds from parametric knowledge in the LLM judgments. We will add controls in the revised manuscript, including explicit prompts instructing the model to rely solely on the retrieved documents for judgments and, where feasible, comparisons using retrieval-only evaluation setups to better isolate retrieval quality effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation of new retrieval measure

full rationale

The paper introduces a simple retrieval quality measure and evaluates established strategies plus the new measure solely by their correlation with LLM-based judgments of response quality generated from retrieved documents. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described chain; the measure is presented as independent and the validation proxy (LLM judgments) is external to the measure's definition. On datasets with 2-15 relevant documents, recall remains computable but is deliberately not used for the new measure, keeping the reported performance independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information in the abstract to identify specific free parameters, axioms, or invented entities used by the new measure or evaluation setup.

pith-pipeline@v0.9.0 · 5376 in / 906 out tokens · 23473 ms · 2026-05-16T20:32:54.185447+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

shallow judging.Information Processing & Management, 54(1):37–59

Intelligent topic selection for low-cost infor- mation retrieval evaluation: A new perspective on deep vs. shallow judging.Information Processing & Management, 54(1):37–59. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- field, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- ton Lee, Kristina Toutanov...

work page arXiv 2019
[2]

Filip Radlinski and Nick Craswell

Active testing: An efficient and robust frame- work for estimating accuracy.arXiv. Filip Radlinski and Nick Craswell. 2010. Comparing the sensitivity of information retrieval metrics. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval, SIGIR ’10, page 667–674, New York, NY , USA. Associatio...

work page arXiv 2010
[3]

Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells

LLMs can patch up missing relevance judg- ments in evaluation.arXiv, arXiv:2405.04727. Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2020. Assessing ranking metrics in top-N recommendation.Inf. Retr., 23(4):411–448. C.J. Van Rijsbergen. 1979.Information Retrieval, sec- ond edition. Butterworths, London. Liang Wang, Nan Yang, Xia...

work page arXiv 2020
[4]

On Natural Questions; using only segments with minimum 300 samples, the ratio K Np is rounded to the first digit

Graded samples 8https://huggingface.co/datasets/primer-ai/retrieval- response Figure 19: Distribution of the response score (1 to 5) for embedding models shown on Y-axis. On Natural Questions; using only segments with minimum 300 samples, the ratio K Np is rounded to the first digit. A.1.1 Query-texts samples Query-texts samples consist of 6112 samples, e...

work page
[5]

id”: A string Id of the sample. The Id con- sists of a name of a subset, concatenated by “-

“id”: A string Id of the sample. The Id con- sists of a name of a subset, concatenated by “-” with Id of the item in the subset. For exam- ple, Id=“N-5” means that it is sample #5 from the subset Natural Questions. Each sample is uniquely identified by its Id

work page
[6]

“p”: The list of positives

work page
[7]

n”: The list of negatives. All the subsets in the dataset: “A

“n”: The list of negatives. All the subsets in the dataset: “A”, “Hp-e”, “Hp-h”, “Hp-m”, “Hs-e”, “Hs-h”, “Hs-m”, “M”, “N”. The short names, as explained in Section 2, are “A” for ARXIV , “H” for HotpotQA, “M” for MSMARCO and “N” for Natural Questions. HotpotQA appears with two different granulari- ties: The positives and negatives are (1) paragraphs in “H...

work page
[8]

c1”: The symbolic name of a category for positives. (The name of this category serves as the query and is stored with the key “q

“c1”: The symbolic name of a category for positives. (The name of this category serves as the query and is stored with the key “q”.)

work page
[9]

“c2”: The symbolic name of a category for negatives

work page
[10]

q2”: The name of the category for negatives (not used). For example, a sample with id=“A-0

“q2”: The name of the category for negatives (not used). For example, a sample with id=“A-0” has c1=“math.ca”, c2=“math.pr”, q=“classical analysis and ODEs” and q2=“probability” (it also has a list of positives “p” and a list of negatives “n”). A.1.2 Ranked samples Each of the query-texts samples (Appendix A.1.1) can be used as a retrieval example with di...

work page
[12]

The embedding used for ranking of all the candidates and selecting top- K candidates

“E”: The embedding’s short notation, as spec- ified in Section 2. The embedding used for ranking of all the candidates and selecting top- K candidates

work page
[13]

Nc”: Total number of candidates (positives and negatives), taken from the corresponding query-texts sample (with the same “id

“Nc”: Total number of candidates (positives and negatives), taken from the corresponding query-texts sample (with the same “id”)

work page
[14]

Np”: Total number of positives, taken as the first Np positives “p

“Np”: Total number of positives, taken as the first Np positives “p” of the corresponding query-texts sample. (Negatives are also taken as firstN c-Np from the negatives “n”.)

work page
[15]

K”: A sorted list of all the K (number of retrieved candidates, “top-K

“K”: A sorted list of all the K (number of retrieved candidates, “top-K”) used for this sample

work page
[16]

P”: A list of precisions calculated for the top- K specified in the list “K

“P”: A list of precisions calculated for the top- K specified in the list “K”, in the same order. Has the same length as the list “K”

work page
[17]

R”: A list of recalls calculated for the top-K specified in the list “K

“R”: A list of recalls calculated for the top-K specified in the list “K”, in the same order

work page
[18]

rank”: A list (length Nc) of indexes of all the candidates, sorted by ranks accordingly to co- sine similarities with query, by the embedding “E

“rank”: A list (length Nc) of indexes of all the candidates, sorted by ranks accordingly to co- sine similarities with query, by the embedding “E”. Each ranked sample is uniquely identified by the tuple (id, E, Nc, Np). In order to get the ranked texts corresponding to the “rank” list of a ranked sample Sr, its query- texts sample Sq (the sample with the ...

work page
[19]

“id”: A string Id of the sample, the same as in the query-texts samples

work page
[20]

“E”: The embedding’s short notation

work page
[21]

“Nc”: Total number of candidates (positives and negatives)

work page
[22]

“Np”: Total number of positives

work page
[23]

K”: A value of K (“top-K

“K”: A value of K (“top-K”) taken from the list of “K” in the corresponding ranked sample

work page
[24]

rank”: A list equal to the first K elements of the list “rank

“rank”: A list equal to the first K elements of the list “rank” of the corresponding ranked sampleS r, i.e. Sr[“rank”][:K]

work page
[25]

inK”: A list created from the “rank

“inK”: A list created from the “rank” (the item above), by replacement of each index by 1 (if positive) or 0 (if negative)

work page
[26]

answer_ideal

“answer_ideal”: LLM-generated answer to the query, obtained by using all the positives from the corresponding query-texts sample

work page
[27]

answer_topK

“answer_topK”: LLN-generated answer to the query, obtained by using the retrieved K can- didates, given to LLM in their ranking order

work page
[28]

“grade”: The LLM-generated score (on Link- ert scale from 1 to 5), obtained by comparing the top-K answer to the ideal answer, with the knowledge of the query

work page
[29]

“P”: A value of precision corresponding to the selected K; given here for convenience

work page
[30]

Each graded sample is uniquely identified by the tuple (id, E, Nc, Np, K) and it is related to its ranked sample by the tuple (id, E, Nc, Np)

“R”: A value of recall corresponding to the selected K; given here for convenience. Each graded sample is uniquely identified by the tuple (id, E, Nc, Np, K) and it is related to its ranked sample by the tuple (id, E, Nc, Np). To assure an understanding of the data of a graded sample Sg and of the corresponding ranked sample Sr, see a few assertions in Fi...

work page

[1] [1]

shallow judging.Information Processing & Management, 54(1):37–59

Intelligent topic selection for low-cost infor- mation retrieval evaluation: A new perspective on deep vs. shallow judging.Information Processing & Management, 54(1):37–59. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- field, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- ton Lee, Kristina Toutanov...

work page arXiv 2019

[2] [2]

Filip Radlinski and Nick Craswell

Active testing: An efficient and robust frame- work for estimating accuracy.arXiv. Filip Radlinski and Nick Craswell. 2010. Comparing the sensitivity of information retrieval metrics. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval, SIGIR ’10, page 667–674, New York, NY , USA. Associatio...

work page arXiv 2010

[3] [3]

Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells

LLMs can patch up missing relevance judg- ments in evaluation.arXiv, arXiv:2405.04727. Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2020. Assessing ranking metrics in top-N recommendation.Inf. Retr., 23(4):411–448. C.J. Van Rijsbergen. 1979.Information Retrieval, sec- ond edition. Butterworths, London. Liang Wang, Nan Yang, Xia...

work page arXiv 2020

[4] [4]

On Natural Questions; using only segments with minimum 300 samples, the ratio K Np is rounded to the first digit

Graded samples 8https://huggingface.co/datasets/primer-ai/retrieval- response Figure 19: Distribution of the response score (1 to 5) for embedding models shown on Y-axis. On Natural Questions; using only segments with minimum 300 samples, the ratio K Np is rounded to the first digit. A.1.1 Query-texts samples Query-texts samples consist of 6112 samples, e...

work page

[5] [5]

id”: A string Id of the sample. The Id con- sists of a name of a subset, concatenated by “-

“id”: A string Id of the sample. The Id con- sists of a name of a subset, concatenated by “-” with Id of the item in the subset. For exam- ple, Id=“N-5” means that it is sample #5 from the subset Natural Questions. Each sample is uniquely identified by its Id

work page

[6] [6]

“p”: The list of positives

work page

[7] [7]

n”: The list of negatives. All the subsets in the dataset: “A

“n”: The list of negatives. All the subsets in the dataset: “A”, “Hp-e”, “Hp-h”, “Hp-m”, “Hs-e”, “Hs-h”, “Hs-m”, “M”, “N”. The short names, as explained in Section 2, are “A” for ARXIV , “H” for HotpotQA, “M” for MSMARCO and “N” for Natural Questions. HotpotQA appears with two different granulari- ties: The positives and negatives are (1) paragraphs in “H...

work page

[8] [8]

c1”: The symbolic name of a category for positives. (The name of this category serves as the query and is stored with the key “q

“c1”: The symbolic name of a category for positives. (The name of this category serves as the query and is stored with the key “q”.)

work page

[9] [9]

“c2”: The symbolic name of a category for negatives

work page

[10] [10]

q2”: The name of the category for negatives (not used). For example, a sample with id=“A-0

“q2”: The name of the category for negatives (not used). For example, a sample with id=“A-0” has c1=“math.ca”, c2=“math.pr”, q=“classical analysis and ODEs” and q2=“probability” (it also has a list of positives “p” and a list of negatives “n”). A.1.2 Ranked samples Each of the query-texts samples (Appendix A.1.1) can be used as a retrieval example with di...

work page

[11] [12]

The embedding used for ranking of all the candidates and selecting top- K candidates

“E”: The embedding’s short notation, as spec- ified in Section 2. The embedding used for ranking of all the candidates and selecting top- K candidates

work page

[12] [13]

Nc”: Total number of candidates (positives and negatives), taken from the corresponding query-texts sample (with the same “id

“Nc”: Total number of candidates (positives and negatives), taken from the corresponding query-texts sample (with the same “id”)

work page

[13] [14]

Np”: Total number of positives, taken as the first Np positives “p

“Np”: Total number of positives, taken as the first Np positives “p” of the corresponding query-texts sample. (Negatives are also taken as firstN c-Np from the negatives “n”.)

work page

[14] [15]

K”: A sorted list of all the K (number of retrieved candidates, “top-K

“K”: A sorted list of all the K (number of retrieved candidates, “top-K”) used for this sample

work page

[15] [16]

P”: A list of precisions calculated for the top- K specified in the list “K

“P”: A list of precisions calculated for the top- K specified in the list “K”, in the same order. Has the same length as the list “K”

work page

[16] [17]

R”: A list of recalls calculated for the top-K specified in the list “K

“R”: A list of recalls calculated for the top-K specified in the list “K”, in the same order

work page

[17] [18]

rank”: A list (length Nc) of indexes of all the candidates, sorted by ranks accordingly to co- sine similarities with query, by the embedding “E

“rank”: A list (length Nc) of indexes of all the candidates, sorted by ranks accordingly to co- sine similarities with query, by the embedding “E”. Each ranked sample is uniquely identified by the tuple (id, E, Nc, Np). In order to get the ranked texts corresponding to the “rank” list of a ranked sample Sr, its query- texts sample Sq (the sample with the ...

work page

[18] [19]

“id”: A string Id of the sample, the same as in the query-texts samples

work page

[19] [20]

“E”: The embedding’s short notation

work page

[20] [21]

“Nc”: Total number of candidates (positives and negatives)

work page

[21] [22]

“Np”: Total number of positives

work page

[22] [23]

K”: A value of K (“top-K

“K”: A value of K (“top-K”) taken from the list of “K” in the corresponding ranked sample

work page

[23] [24]

rank”: A list equal to the first K elements of the list “rank

“rank”: A list equal to the first K elements of the list “rank” of the corresponding ranked sampleS r, i.e. Sr[“rank”][:K]

work page

[24] [25]

inK”: A list created from the “rank

“inK”: A list created from the “rank” (the item above), by replacement of each index by 1 (if positive) or 0 (if negative)

work page

[25] [26]

answer_ideal

“answer_ideal”: LLM-generated answer to the query, obtained by using all the positives from the corresponding query-texts sample

work page

[26] [27]

answer_topK

“answer_topK”: LLN-generated answer to the query, obtained by using the retrieved K can- didates, given to LLM in their ranking order

work page

[27] [28]

“grade”: The LLM-generated score (on Link- ert scale from 1 to 5), obtained by comparing the top-K answer to the ideal answer, with the knowledge of the query

work page

[28] [29]

“P”: A value of precision corresponding to the selected K; given here for convenience

work page

[29] [30]

Each graded sample is uniquely identified by the tuple (id, E, Nc, Np, K) and it is related to its ranked sample by the tuple (id, E, Nc, Np)

“R”: A value of recall corresponding to the selected K; given here for convenience. Each graded sample is uniquely identified by the tuple (id, E, Nc, Np, K) and it is related to its ranked sample by the tuple (id, E, Nc, Np). To assure an understanding of the data of a graded sample Sg and of the corresponding ranked sample Sr, see a few assertions in Fi...

work page