An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs

Hengran Zhang; Jiafeng Guo; Keping Bi; Xueqi Cheng

arxiv: 2406.11290 · v3 · submitted 2024-06-17 · 💻 cs.IR · cs.AI· cs.CL· cs.LG

An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs

Hengran Zhang , Keping Bi , Jiafeng Guo , Xueqi Cheng This is my paper

Pith reviewed 2026-05-24 00:12 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.LG

keywords retrieval-augmented generationutility judgmentphilosophical relevanceiterative frameworklarge language modelsinformation retrievalquestion answeringrelevance ranking

0 comments

The pith

An iterative LLM framework for utility judgment improves RAG ranking and answer generation by aligning components with three philosophical relevance types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an Iterative Utility Judgment Framework (ITEM) that treats the three steps of retrieval-augmented generation as corresponding to three types of relevance from philosophy. These types represent increasing cognitive levels that reinforce one another, so the framework applies LLMs in successive rounds to judge how useful each passage is. The result is better utility scores, stronger ranking of results, and higher-quality answers on standard datasets. A sympathetic reader would care because RAG systems must fit useful content inside limited input windows, and better selection at each step could raise overall effectiveness without extra training.

Core claim

RAG's relevance ranking, utility judgment, and answer generation align with three types of relevance that stand for different cognitive levels and enhance each other. The ITEM framework therefore uses iterative LLM-based utility scoring to promote every step in the pipeline. Experiments on retrieval collections, a utility judgment task, and factoid QA show gains over representative baselines in all three areas.

What carries the argument

ITEM, the iterative utility judgment procedure that runs successive LLM scoring rounds guided by the mapping of RAG components to philosophical relevance types.

If this is right

More accurate utility judgments raise the quality of ranked results fed to the generator.
Higher-utility passages improve the factual correctness of generated answers on factoid questions.
The iterative process benefits ranking, judgment, and generation together rather than in isolation.
The same alignment applies across retrieval, utility, and QA benchmarks without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Philosophical relevance concepts could shape prompting strategies for other LLM content-evaluation tasks.
The number of iterations might be tuned per query type to balance gains against added compute.
Similar mappings could be tested in non-RAG retrieval settings where usefulness matters more than topical match.

Load-bearing premise

The three RAG components map onto the three philosophical relevance types in a way that makes iterative LLM utility scoring produce genuine gains rather than prompt or dataset artifacts.

What would settle it

Running ITEM on the NQ dataset and finding no improvement in answer accuracy or utility judgment quality compared with non-iterative baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2406.11290 by Hengran Zhang, Jiafeng Guo, Keping Bi, Xueqi Cheng.

**Figure 2.** Figure 2: Ia instruction contains the implicit answer and explicit answer. Utility judgments instruction Listwise: Directly output the passages you selected that have utility in generating the reference answer to the question. Pointwise: Directly output whether the passage has utility in generating the reference answer to the question or not [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Iu instruction contains listwise and pointwise approaches. ter, i.e., a, U = f(q, D, I). 3.3 Iterative utiliTy judgmEnt fraMework (ITEM) Schutz (Schutz, 1970) emphasized the existence of various types of relevance and underscored the interactivity and interdependence between these various types from a much broader arena than information science. Inspired by the powerful insight, we propose an Iterative … view at source ↗

**Figure 4.** Figure 4: The flowchart shows the first iteration of the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Ir instruction Schutz’s theory, topical relevance, motivational relevance, and interpretive relevance are all dynamically affected. In the ITEM-A framework, the topical relevance is not updated during the iteration process. Consequently, we have incorporated a relevance ranking task into the ITEM framework, which ensures that all three tasks are executed in a loop. Formally, at iteration t (t ≥ 1), the a… view at source ↗

**Figure 6.** Figure 6: Utility judgments performance (%) of differ [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Topical relevance performance (%) of Mistral [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Instruction in the listwise approach. length of the explicit answers generated by Mistral using “sentences” is too long for factual questions, whereas Llama 3 and chatGPT use “sentences“ to generate answers of moderate length. So we only use “words“ on the TREC dataset and the NQ dataset using Mistral. The two instructions are shown in [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Instruction in the pointwise approach. m NDCG@5 NDCG@10 NDCG@20 Utility-judgments F1 1 71.29 / 70.57 72.90 / 72.69 84.56 / 84.08 52.34 / 42.02 2 72.27 / 72.86 75.16 / 75.48 85.76 / 86.09 53.40 / 42.10 3 73.24 / 74.27 75.53 / 75.78 86.59 / 86.80 56.27 / 44.18 4 73.58 / 75.35 75.52 / 76.83 86.46 / 87.23 56.07 / 44.68 5 73.12 / 74.61 74.75 / 76.20 85.95 / 86.82 57.82 / 44.25 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 11.** Figure 11: Instruction of the relevance ranking approach in our ITEM. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Instruction of the utility ranking approach in our ITEM. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Instruction of the ranking approach in Sun et al. (2023). k, m Ranking Utility judgments N@1 N@3 N@5 N@10 N@20 P R F1 k=1, m=1 72.76 71.27 70.57 72.69 84.08 53.66 24.09 33.25 k=1, m=2 76.02 71.54 71.38 73.66 84.78 58.54 28.73 38.54 k=1, m=3 77.24 72.83 71.83 73.87 85.20 59.76 28.84 38.90 k=1, m=4 77.24 73.04 71.91 73.90 85.25 59.76 28.84 38.90 k=1, m=5 76.02 72.11 71.42 73.45 84.98 58.54 28.71 38.53 k=5, … view at source ↗

**Figure 14.** Figure 14: Instruction of the explicit answer generation. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Instruction of the implicit answer generation. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: An example of our ITEM-As using Mistral on the TREC dataset. Question: when did family feud come out? First pseudo answer: Family Feud has been on air since 1976. First utility judgment: My selection:[1, 3, 13]. Second pseudo answer: The original Family Feud debuted in 1976. Second utility judgment: My selection:[13]. Third pseudo answer: The Family Feud debuted in 1976. Third utility judgment: My selecti… view at source ↗

**Figure 17.** Figure 17: An example of our ITEM-As using Mistral on the TREC dataset [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

read the original abstract

Relevance and utility are two frequently used measures to evaluate the effectiveness of an information retrieval (IR) system. Relevance emphasizes the aboutness of a result to a query, while utility refers to the result's usefulness or value to an information seeker. In retrieval-augmented generation (RAG), high-utility results should be prioritized to feed to LLMs due to their limited input bandwidth. Re-examining RAG's three core components-relevance ranking derived from retrieval models, utility judgments, and answer generation-aligns with Schutz's philosophical system of relevances, which encompasses three types of relevance representing different levels of human cognition that enhance each other. These three RAG components also reflect three cognitive levels for LLMs in question-answering. Therefore, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step in RAG. We conducted extensive experiments on retrieval (TREC DL, WebAP), utility judgment task (GTI-NQ), and factoid question-answering (NQ) datasets. Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ITEM wraps iterative RAG prompting in a Schutz-inspired mapping but the abstract supplies no numbers, so the actual gains stay unproven.

read the letter

The main takeaway is that this paper proposes ITEM, an iterative utility judgment loop for RAG that maps relevance ranking, utility scoring, and answer generation onto Schutz's three cognitive levels of relevance. They test it on TREC DL and WebAP for ranking, GTI-NQ for utility judgments, and NQ for QA, claiming improvements over baselines in each area. The explicit three-way philosophical framing is the clearest new element; most prior RAG work already uses multi-step LLM calls for re-ranking and filtering, so the contribution sits in the structured alignment rather than the iteration itself. The multi-dataset coverage is a reasonable choice and shows they tried to check the components separately rather than only end-to-end QA scores. The soft spot is the missing evidence. The abstract states improvements without any numbers, effect sizes, statistical tests, or ablation results, which leaves open whether the Schutz distinctions are doing real work or whether any extra round of prompting would produce similar lifts. The stress-test note is accurate on this point: if the mapping is inspirational rather than load-bearing, the paper reduces to another prompt-engineering variant. No circularity or invented quantities appear in the description, and the citations track standard RAG literature. This is for people already tuning RAG pipelines who want a conceptual scaffold for utility-focused iteration; it will not shift the broader field. The work is coherent enough on its own terms to merit referee time, so I would send it out, with the clear expectation that reviewers will require detailed results and controls before any stronger claims can be accepted.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an Iterative utiliTy judgmEnt fraMework (ITEM) that aligns RAG's three components (relevance ranking from retrieval models, utility judgments, and answer generation) with Schutz's three types of relevance representing distinct cognitive levels. It claims this alignment enables an iterative LLM procedure that improves utility judgments, ranking, and answer generation, with experiments on TREC DL, WebAP, GTI-NQ, and NQ datasets showing gains over representative baselines.

Significance. If the central claim holds and the Schutz-derived distinctions are shown to be necessary rather than incidental to iterative prompting, the work could provide an interdisciplinary lens for structuring LLM interactions in IR. The attempt to map cognitive levels explicitly is a conceptual strength, but the abstract supplies no quantitative metrics, baselines, ablations, or statistical tests, limiting assessment of whether the result would advance the field beyond existing multi-step LLM techniques.

major comments (2)

[Abstract] Abstract: the assertion that 'Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines' supplies no numerical values, baseline identifiers, effect sizes, or statistical tests. This absence is load-bearing for the experimental claim and prevents evaluation of whether the reported gains support the framework.
[Methods] Methods/Experimental design: the central claim requires that the three-way alignment with Schutz's relevances produces genuine iterative improvements beyond standard multi-step LLM prompting. No ablation is described that retains iteration while removing the philosophical distinctions (e.g., a control using generic iterative scoring), leaving open whether the mapping is load-bearing or inspirational only.

minor comments (2)

The stylized capitalization in the title and acronym definition (utiliTy judgmEnt fraMework) is unconventional and may reduce readability; consider standard title-case formatting.
No statement on code, data, or prompt availability is provided, which would aid reproducibility of the LLM-based procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines' supplies no numerical values, baseline identifiers, effect sizes, or statistical tests. This absence is load-bearing for the experimental claim and prevents evaluation of whether the reported gains support the framework.

Authors: We agree that the abstract would benefit from including specific quantitative results to substantiate the claims. In the revised manuscript, we will update the abstract to report key performance improvements (including effect sizes and baseline identifiers) from the experiments on TREC DL, WebAP, GTI-NQ, and NQ. revision: yes
Referee: [Methods] Methods/Experimental design: the central claim requires that the three-way alignment with Schutz's relevances produces genuine iterative improvements beyond standard multi-step LLM prompting. No ablation is described that retains iteration while removing the philosophical distinctions (e.g., a control using generic iterative scoring), leaving open whether the mapping is load-bearing or inspirational only.

Authors: We acknowledge that an explicit ablation isolating the Schutz-derived distinctions from generic iterative prompting would strengthen the central claim. We will add this ablation study to the revised manuscript, comparing the full ITEM framework against a control that preserves the iterative loop but removes the specific philosophical mappings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is conceptually motivated without reduction to self-defined inputs

full rationale

The paper motivates ITEM by noting an alignment between RAG's three components and Schutz's three types of relevance, then proposes the iterative framework on that basis. No equations, fitted parameters, or predictions appear in the provided text, so no step reduces a claimed result to a quantity defined by the authors' own choices or prior self-citations. The mapping functions as inspirational analogy rather than a load-bearing derivation that forces outcomes by construction, and experiments are reported on independent external datasets (TREC DL, WebAP, GTI-NQ, NQ) without statistical forcing from internal fits. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested mapping between Schutz's philosophical relevances and RAG stages plus the assumption that LLMs can reliably perform the utility judgment step when prompted iteratively; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Schutz's three types of relevance correspond to the three core components of RAG and to three cognitive levels for LLMs
Explicitly invoked in the abstract as the justification for the iterative framework.

pith-pipeline@v0.9.0 · 5741 in / 1407 out tokens · 22652 ms · 2026-05-24T00:12:56.186165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 2 internal anchors

[1]

Mistral 7B

Performance prediction for non-factoid ques- tion answering. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Infor- mation Retrieval, pages 55–58. Gautier Izacard, Patrick Lewis, Maria Lomeli, Lu- cas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot ...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

In 2016 Eleventh In- ternational Conference on Digital Information Man- agement (ICDIM), pages 121–126

Towards automatic generation of relevance judgments for a test collection. In 2016 Eleventh In- ternational Conference on Digital Information Man- agement (ICDIM), pages 121–126. IEEE. Meta. 2024. Welcome llama 3 - meta’s new open llm. Donald Metzler and W Bruce Croft. 2005. A markov random field model for term dependencies. In Pro- ceedings of the 28th a...

work page 2016
[3]

MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Inte- grating neural and symbolic approaches 2016 co- located with the 30th Annual Conference on Neu- ral Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016 , volume 1773 of CEUR Workshop Proceedings. CEUR...

work page 2016
[4]

Passage Re-ranking with BERT

Passage re-ranking with BERT. CoRR, abs/1901.04085. OpenAI. 2022. Introducing chatgpt. OpenAI. 2023. Gpt-4 technical report. Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. Rankvicuna: Zero-shot listwise doc- ument reranking with open-source large language models. arXiv preprint arXiv:2309.15088. Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuan...

work page internal anchor Pith review Pith/arXiv arXiv 1901
[5]

arXiv preprint arXiv:2312.06585

Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585. Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming- Wei Chang. 2022. ASQA: factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab ...

work page arXiv 2022
[6]

The reference answer may not be the correct answer, but it provides a pattern of the correct answer

Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance la- bels. 2024 Annual Conference of the North Amer- ican Chapter of the Association for Computational Linguistics. A Instruction Details A.1 Instruction of Listwise and Pointwise Approaches For the prompts of the NQ dataset using ChatGPT, we follow the setting of Zhang e...

work page 2024
[8]

sentences

{{passage_2}} assistant : Received passage [2] (more passages) ... user: Question: {query}. Reference answer: {answer}. The requirements for judging whether a passage has utility in answering the question are: The passage has utility in answering the question, meaning that the passage not only be relevant to the question, but also be useful in generating ...

work page 2022
[10]

user: Query: {query}

{{passage_2}} assistant : Received passage [2] (more passages) ... user: Query: {query}. Reference answer: {answer} Rank the {num} passages above based on their relevance to the query. The passages should be listed in descending order using identifiers. The most relevant passages should be listed first. The output format should be [] > [] > [] > ..., e.g....

work page
[12]

user: Question: {query}

{{passage_2}} assistant : Received passage [2] (more passages) ... user: Question: {query}. Reference answer: {answer} Rank the {num} passages above based on their utility in generating the reference answer to the question. The passages should be listed in utility descending order using identifiers. The passages that have utility in generating the referen...

work page
[13]

{{passage_1}} assistant : Received passage [1] user:

work page
[14]

N@k” means “NDCG@k

{{passage_2}} assistant : Received passage [2] (more passages) ... user: Query: {query}. Rank the {num} passages above based on their relevance to the query. The passages should be listed in descending order using identifiers. The most relevant passages should be listed first. The output format should be [] > [] > [] > ..., e.g., [i] > [ j] > [k] > ... On...

work page 2023

[1] [1]

Mistral 7B

Performance prediction for non-factoid ques- tion answering. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Infor- mation Retrieval, pages 55–58. Gautier Izacard, Patrick Lewis, Maria Lomeli, Lu- cas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot ...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

In 2016 Eleventh In- ternational Conference on Digital Information Man- agement (ICDIM), pages 121–126

Towards automatic generation of relevance judgments for a test collection. In 2016 Eleventh In- ternational Conference on Digital Information Man- agement (ICDIM), pages 121–126. IEEE. Meta. 2024. Welcome llama 3 - meta’s new open llm. Donald Metzler and W Bruce Croft. 2005. A markov random field model for term dependencies. In Pro- ceedings of the 28th a...

work page 2016

[3] [3]

MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Inte- grating neural and symbolic approaches 2016 co- located with the 30th Annual Conference on Neu- ral Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016 , volume 1773 of CEUR Workshop Proceedings. CEUR...

work page 2016

[4] [4]

Passage Re-ranking with BERT

Passage re-ranking with BERT. CoRR, abs/1901.04085. OpenAI. 2022. Introducing chatgpt. OpenAI. 2023. Gpt-4 technical report. Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. Rankvicuna: Zero-shot listwise doc- ument reranking with open-source large language models. arXiv preprint arXiv:2309.15088. Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuan...

work page internal anchor Pith review Pith/arXiv arXiv 1901

[5] [5]

arXiv preprint arXiv:2312.06585

Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585. Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming- Wei Chang. 2022. ASQA: factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab ...

work page arXiv 2022

[6] [6]

The reference answer may not be the correct answer, but it provides a pattern of the correct answer

Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance la- bels. 2024 Annual Conference of the North Amer- ican Chapter of the Association for Computational Linguistics. A Instruction Details A.1 Instruction of Listwise and Pointwise Approaches For the prompts of the NQ dataset using ChatGPT, we follow the setting of Zhang e...

work page 2024

[7] [8]

sentences

{{passage_2}} assistant : Received passage [2] (more passages) ... user: Question: {query}. Reference answer: {answer}. The requirements for judging whether a passage has utility in answering the question are: The passage has utility in answering the question, meaning that the passage not only be relevant to the question, but also be useful in generating ...

work page 2022

[8] [10]

user: Query: {query}

{{passage_2}} assistant : Received passage [2] (more passages) ... user: Query: {query}. Reference answer: {answer} Rank the {num} passages above based on their relevance to the query. The passages should be listed in descending order using identifiers. The most relevant passages should be listed first. The output format should be [] > [] > [] > ..., e.g....

work page

[9] [12]

user: Question: {query}

{{passage_2}} assistant : Received passage [2] (more passages) ... user: Question: {query}. Reference answer: {answer} Rank the {num} passages above based on their utility in generating the reference answer to the question. The passages should be listed in utility descending order using identifiers. The passages that have utility in generating the referen...

work page

[10] [13]

{{passage_1}} assistant : Received passage [1] user:

work page

[11] [14]

N@k” means “NDCG@k

{{passage_2}} assistant : Received passage [2] (more passages) ... user: Query: {query}. Rank the {num} passages above based on their relevance to the query. The passages should be listed in descending order using identifiers. The most relevant passages should be listed first. The output format should be [] > [] > [] > ..., e.g., [i] > [ j] > [k] > ... On...

work page 2023