An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs
Pith reviewed 2026-05-24 00:12 UTC · model grok-4.3
The pith
An iterative LLM framework for utility judgment improves RAG ranking and answer generation by aligning components with three philosophical relevance types.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAG's relevance ranking, utility judgment, and answer generation align with three types of relevance that stand for different cognitive levels and enhance each other. The ITEM framework therefore uses iterative LLM-based utility scoring to promote every step in the pipeline. Experiments on retrieval collections, a utility judgment task, and factoid QA show gains over representative baselines in all three areas.
What carries the argument
ITEM, the iterative utility judgment procedure that runs successive LLM scoring rounds guided by the mapping of RAG components to philosophical relevance types.
If this is right
- More accurate utility judgments raise the quality of ranked results fed to the generator.
- Higher-utility passages improve the factual correctness of generated answers on factoid questions.
- The iterative process benefits ranking, judgment, and generation together rather than in isolation.
- The same alignment applies across retrieval, utility, and QA benchmarks without task-specific retraining.
Where Pith is reading between the lines
- Philosophical relevance concepts could shape prompting strategies for other LLM content-evaluation tasks.
- The number of iterations might be tuned per query type to balance gains against added compute.
- Similar mappings could be tested in non-RAG retrieval settings where usefulness matters more than topical match.
Load-bearing premise
The three RAG components map onto the three philosophical relevance types in a way that makes iterative LLM utility scoring produce genuine gains rather than prompt or dataset artifacts.
What would settle it
Running ITEM on the NQ dataset and finding no improvement in answer accuracy or utility judgment quality compared with non-iterative baselines would falsify the central claim.
Figures
read the original abstract
Relevance and utility are two frequently used measures to evaluate the effectiveness of an information retrieval (IR) system. Relevance emphasizes the aboutness of a result to a query, while utility refers to the result's usefulness or value to an information seeker. In retrieval-augmented generation (RAG), high-utility results should be prioritized to feed to LLMs due to their limited input bandwidth. Re-examining RAG's three core components-relevance ranking derived from retrieval models, utility judgments, and answer generation-aligns with Schutz's philosophical system of relevances, which encompasses three types of relevance representing different levels of human cognition that enhance each other. These three RAG components also reflect three cognitive levels for LLMs in question-answering. Therefore, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step in RAG. We conducted extensive experiments on retrieval (TREC DL, WebAP), utility judgment task (GTI-NQ), and factoid question-answering (NQ) datasets. Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an Iterative utiliTy judgmEnt fraMework (ITEM) that aligns RAG's three components (relevance ranking from retrieval models, utility judgments, and answer generation) with Schutz's three types of relevance representing distinct cognitive levels. It claims this alignment enables an iterative LLM procedure that improves utility judgments, ranking, and answer generation, with experiments on TREC DL, WebAP, GTI-NQ, and NQ datasets showing gains over representative baselines.
Significance. If the central claim holds and the Schutz-derived distinctions are shown to be necessary rather than incidental to iterative prompting, the work could provide an interdisciplinary lens for structuring LLM interactions in IR. The attempt to map cognitive levels explicitly is a conceptual strength, but the abstract supplies no quantitative metrics, baselines, ablations, or statistical tests, limiting assessment of whether the result would advance the field beyond existing multi-step LLM techniques.
major comments (2)
- [Abstract] Abstract: the assertion that 'Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines' supplies no numerical values, baseline identifiers, effect sizes, or statistical tests. This absence is load-bearing for the experimental claim and prevents evaluation of whether the reported gains support the framework.
- [Methods] Methods/Experimental design: the central claim requires that the three-way alignment with Schutz's relevances produces genuine iterative improvements beyond standard multi-step LLM prompting. No ablation is described that retains iteration while removing the philosophical distinctions (e.g., a control using generic iterative scoring), leaving open whether the mapping is load-bearing or inspirational only.
minor comments (2)
- The stylized capitalization in the title and acronym definition (utiliTy judgmEnt fraMework) is unconventional and may reduce readability; consider standard title-case formatting.
- No statement on code, data, or prompt availability is provided, which would aid reproducibility of the LLM-based procedure.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines' supplies no numerical values, baseline identifiers, effect sizes, or statistical tests. This absence is load-bearing for the experimental claim and prevents evaluation of whether the reported gains support the framework.
Authors: We agree that the abstract would benefit from including specific quantitative results to substantiate the claims. In the revised manuscript, we will update the abstract to report key performance improvements (including effect sizes and baseline identifiers) from the experiments on TREC DL, WebAP, GTI-NQ, and NQ. revision: yes
-
Referee: [Methods] Methods/Experimental design: the central claim requires that the three-way alignment with Schutz's relevances produces genuine iterative improvements beyond standard multi-step LLM prompting. No ablation is described that retains iteration while removing the philosophical distinctions (e.g., a control using generic iterative scoring), leaving open whether the mapping is load-bearing or inspirational only.
Authors: We acknowledge that an explicit ablation isolating the Schutz-derived distinctions from generic iterative prompting would strengthen the central claim. We will add this ablation study to the revised manuscript, comparing the full ITEM framework against a control that preserves the iterative loop but removes the specific philosophical mappings. revision: yes
Circularity Check
No significant circularity; framework is conceptually motivated without reduction to self-defined inputs
full rationale
The paper motivates ITEM by noting an alignment between RAG's three components and Schutz's three types of relevance, then proposes the iterative framework on that basis. No equations, fitted parameters, or predictions appear in the provided text, so no step reduces a claimed result to a quantity defined by the authors' own choices or prior self-citations. The mapping functions as inspirational analogy rather than a load-bearing derivation that forces outcomes by construction, and experiments are reported on independent external datasets (TREC DL, WebAP, GTI-NQ, NQ) without statistical forcing from internal fits. The derivation chain therefore remains self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Schutz's three types of relevance correspond to the three core components of RAG and to three cognitive levels for LLMs
Reference graph
Works this paper leans on
-
[1]
Performance prediction for non-factoid ques- tion answering. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Infor- mation Retrieval, pages 55–58. Gautier Izacard, Patrick Lewis, Maria Lomeli, Lu- cas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot ...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
Towards automatic generation of relevance judgments for a test collection. In 2016 Eleventh In- ternational Conference on Digital Information Man- agement (ICDIM), pages 121–126. IEEE. Meta. 2024. Welcome llama 3 - meta’s new open llm. Donald Metzler and W Bruce Croft. 2005. A markov random field model for term dependencies. In Pro- ceedings of the 28th a...
work page 2016
-
[3]
MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Inte- grating neural and symbolic approaches 2016 co- located with the 30th Annual Conference on Neu- ral Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016 , volume 1773 of CEUR Workshop Proceedings. CEUR...
work page 2016
-
[4]
Passage re-ranking with BERT. CoRR, abs/1901.04085. OpenAI. 2022. Introducing chatgpt. OpenAI. 2023. Gpt-4 technical report. Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. Rankvicuna: Zero-shot listwise doc- ument reranking with open-source large language models. arXiv preprint arXiv:2309.15088. Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuan...
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[5]
arXiv preprint arXiv:2312.06585
Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585. Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming- Wei Chang. 2022. ASQA: factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab ...
-
[6]
The reference answer may not be the correct answer, but it provides a pattern of the correct answer
Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance la- bels. 2024 Annual Conference of the North Amer- ican Chapter of the Association for Computational Linguistics. A Instruction Details A.1 Instruction of Listwise and Pointwise Approaches For the prompts of the NQ dataset using ChatGPT, we follow the setting of Zhang e...
work page 2024
-
[8]
{{passage_2}} assistant : Received passage [2] (more passages) ... user: Question: {query}. Reference answer: {answer}. The requirements for judging whether a passage has utility in answering the question are: The passage has utility in answering the question, meaning that the passage not only be relevant to the question, but also be useful in generating ...
work page 2022
-
[10]
{{passage_2}} assistant : Received passage [2] (more passages) ... user: Query: {query}. Reference answer: {answer} Rank the {num} passages above based on their relevance to the query. The passages should be listed in descending order using identifiers. The most relevant passages should be listed first. The output format should be [] > [] > [] > ..., e.g....
-
[12]
{{passage_2}} assistant : Received passage [2] (more passages) ... user: Question: {query}. Reference answer: {answer} Rank the {num} passages above based on their utility in generating the reference answer to the question. The passages should be listed in utility descending order using identifiers. The passages that have utility in generating the referen...
-
[13]
{{passage_1}} assistant : Received passage [1] user:
-
[14]
{{passage_2}} assistant : Received passage [2] (more passages) ... user: Query: {query}. Rank the {num} passages above based on their relevance to the query. The passages should be listed in descending order using identifiers. The most relevant passages should be listed first. The output format should be [] > [] > [] > ..., e.g., [i] > [ j] > [k] > ... On...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.