pith. sign in

arxiv: 2406.11290 · v3 · submitted 2024-06-17 · 💻 cs.IR · cs.AI· cs.CL· cs.LG

An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs

Pith reviewed 2026-05-24 00:12 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.LG
keywords retrieval-augmented generationutility judgmentphilosophical relevanceiterative frameworklarge language modelsinformation retrievalquestion answeringrelevance ranking
0
0 comments X

The pith

An iterative LLM framework for utility judgment improves RAG ranking and answer generation by aligning components with three philosophical relevance types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an Iterative Utility Judgment Framework (ITEM) that treats the three steps of retrieval-augmented generation as corresponding to three types of relevance from philosophy. These types represent increasing cognitive levels that reinforce one another, so the framework applies LLMs in successive rounds to judge how useful each passage is. The result is better utility scores, stronger ranking of results, and higher-quality answers on standard datasets. A sympathetic reader would care because RAG systems must fit useful content inside limited input windows, and better selection at each step could raise overall effectiveness without extra training.

Core claim

RAG's relevance ranking, utility judgment, and answer generation align with three types of relevance that stand for different cognitive levels and enhance each other. The ITEM framework therefore uses iterative LLM-based utility scoring to promote every step in the pipeline. Experiments on retrieval collections, a utility judgment task, and factoid QA show gains over representative baselines in all three areas.

What carries the argument

ITEM, the iterative utility judgment procedure that runs successive LLM scoring rounds guided by the mapping of RAG components to philosophical relevance types.

If this is right

  • More accurate utility judgments raise the quality of ranked results fed to the generator.
  • Higher-utility passages improve the factual correctness of generated answers on factoid questions.
  • The iterative process benefits ranking, judgment, and generation together rather than in isolation.
  • The same alignment applies across retrieval, utility, and QA benchmarks without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Philosophical relevance concepts could shape prompting strategies for other LLM content-evaluation tasks.
  • The number of iterations might be tuned per query type to balance gains against added compute.
  • Similar mappings could be tested in non-RAG retrieval settings where usefulness matters more than topical match.

Load-bearing premise

The three RAG components map onto the three philosophical relevance types in a way that makes iterative LLM utility scoring produce genuine gains rather than prompt or dataset artifacts.

What would settle it

Running ITEM on the NQ dataset and finding no improvement in answer accuracy or utility judgment quality compared with non-iterative baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2406.11290 by Hengran Zhang, Jiafeng Guo, Keping Bi, Xueqi Cheng.

Figure 1
Figure 1. Figure 1: Schutz’s “system of relevancies” and the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ia instruction contains the implicit answer and explicit answer. Utility judgments instruction Listwise: Directly output the passages you selected that have utility in generating the reference answer to the question. Pointwise: Directly output whether the passage has utility in generating the reference answer to the question or not [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Iu instruction contains listwise and pointwise approaches. ter, i.e., a, U = f(q, D, I). 3.3 Iterative utiliTy judgmEnt fraMework (ITEM) Schutz (Schutz, 1970) emphasized the existence of various types of relevance and underscored the in￾teractivity and interdependence between these var￾ious types from a much broader arena than infor￾mation science. Inspired by the powerful insight, we propose an Iterative … view at source ↗
Figure 4
Figure 4. Figure 4: The flowchart shows the first iteration of the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ir instruction Schutz’s theory, topical relevance, motivational relevance, and interpretive relevance are all dy￾namically affected. In the ITEM-A framework, the topical relevance is not updated during the iteration process. Consequently, we have incor￾porated a relevance ranking task into the ITEM framework, which ensures that all three tasks are executed in a loop. Formally, at iteration t (t ≥ 1), the a… view at source ↗
Figure 6
Figure 6. Figure 6: Utility judgments performance (%) of differ [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Topical relevance performance (%) of Mistral [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Instruction in the listwise approach. length of the explicit answers generated by Mistral using “sentences” is too long for factual questions, whereas Llama 3 and chatGPT use “sentences“ to generate answers of moderate length. So we only use “words“ on the TREC dataset and the NQ dataset using Mistral. The two instructions are shown in [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Instruction in the pointwise approach. m NDCG@5 NDCG@10 NDCG@20 Utility-judgments F1 1 71.29 / 70.57 72.90 / 72.69 84.56 / 84.08 52.34 / 42.02 2 72.27 / 72.86 75.16 / 75.48 85.76 / 86.09 53.40 / 42.10 3 73.24 / 74.27 75.53 / 75.78 86.59 / 86.80 56.27 / 44.18 4 73.58 / 75.35 75.52 / 76.83 86.46 / 87.23 56.07 / 44.68 5 73.12 / 74.61 74.75 / 76.20 85.95 / 86.82 57.82 / 44.25 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 11
Figure 11. Figure 11: Instruction of the relevance ranking approach in our ITEM. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Instruction of the utility ranking approach in our ITEM. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Instruction of the ranking approach in Sun et al. (2023). k, m Ranking Utility judgments N@1 N@3 N@5 N@10 N@20 P R F1 k=1, m=1 72.76 71.27 70.57 72.69 84.08 53.66 24.09 33.25 k=1, m=2 76.02 71.54 71.38 73.66 84.78 58.54 28.73 38.54 k=1, m=3 77.24 72.83 71.83 73.87 85.20 59.76 28.84 38.90 k=1, m=4 77.24 73.04 71.91 73.90 85.25 59.76 28.84 38.90 k=1, m=5 76.02 72.11 71.42 73.45 84.98 58.54 28.71 38.53 k=5, … view at source ↗
Figure 14
Figure 14. Figure 14: Instruction of the explicit answer generation. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Instruction of the implicit answer generation. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: An example of our ITEM-As using Mistral on the TREC dataset. Question: when did family feud come out? First pseudo answer: Family Feud has been on air since 1976. First utility judgment: My selection:[1, 3, 13]. Second pseudo answer: The original Family Feud debuted in 1976. Second utility judgment: My selection:[13]. Third pseudo answer: The Family Feud debuted in 1976. Third utility judgment: My selecti… view at source ↗
Figure 17
Figure 17. Figure 17: An example of our ITEM-As using Mistral on the TREC dataset [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
read the original abstract

Relevance and utility are two frequently used measures to evaluate the effectiveness of an information retrieval (IR) system. Relevance emphasizes the aboutness of a result to a query, while utility refers to the result's usefulness or value to an information seeker. In retrieval-augmented generation (RAG), high-utility results should be prioritized to feed to LLMs due to their limited input bandwidth. Re-examining RAG's three core components-relevance ranking derived from retrieval models, utility judgments, and answer generation-aligns with Schutz's philosophical system of relevances, which encompasses three types of relevance representing different levels of human cognition that enhance each other. These three RAG components also reflect three cognitive levels for LLMs in question-answering. Therefore, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step in RAG. We conducted extensive experiments on retrieval (TREC DL, WebAP), utility judgment task (GTI-NQ), and factoid question-answering (NQ) datasets. Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an Iterative utiliTy judgmEnt fraMework (ITEM) that aligns RAG's three components (relevance ranking from retrieval models, utility judgments, and answer generation) with Schutz's three types of relevance representing distinct cognitive levels. It claims this alignment enables an iterative LLM procedure that improves utility judgments, ranking, and answer generation, with experiments on TREC DL, WebAP, GTI-NQ, and NQ datasets showing gains over representative baselines.

Significance. If the central claim holds and the Schutz-derived distinctions are shown to be necessary rather than incidental to iterative prompting, the work could provide an interdisciplinary lens for structuring LLM interactions in IR. The attempt to map cognitive levels explicitly is a conceptual strength, but the abstract supplies no quantitative metrics, baselines, ablations, or statistical tests, limiting assessment of whether the result would advance the field beyond existing multi-step LLM techniques.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines' supplies no numerical values, baseline identifiers, effect sizes, or statistical tests. This absence is load-bearing for the experimental claim and prevents evaluation of whether the reported gains support the framework.
  2. [Methods] Methods/Experimental design: the central claim requires that the three-way alignment with Schutz's relevances produces genuine iterative improvements beyond standard multi-step LLM prompting. No ablation is described that retains iteration while removing the philosophical distinctions (e.g., a control using generic iterative scoring), leaving open whether the mapping is load-bearing or inspirational only.
minor comments (2)
  1. The stylized capitalization in the title and acronym definition (utiliTy judgmEnt fraMework) is unconventional and may reduce readability; consider standard title-case formatting.
  2. No statement on code, data, or prompt availability is provided, which would aid reproducibility of the LLM-based procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines' supplies no numerical values, baseline identifiers, effect sizes, or statistical tests. This absence is load-bearing for the experimental claim and prevents evaluation of whether the reported gains support the framework.

    Authors: We agree that the abstract would benefit from including specific quantitative results to substantiate the claims. In the revised manuscript, we will update the abstract to report key performance improvements (including effect sizes and baseline identifiers) from the experiments on TREC DL, WebAP, GTI-NQ, and NQ. revision: yes

  2. Referee: [Methods] Methods/Experimental design: the central claim requires that the three-way alignment with Schutz's relevances produces genuine iterative improvements beyond standard multi-step LLM prompting. No ablation is described that retains iteration while removing the philosophical distinctions (e.g., a control using generic iterative scoring), leaving open whether the mapping is load-bearing or inspirational only.

    Authors: We acknowledge that an explicit ablation isolating the Schutz-derived distinctions from generic iterative prompting would strengthen the central claim. We will add this ablation study to the revised manuscript, comparing the full ITEM framework against a control that preserves the iterative loop but removes the specific philosophical mappings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is conceptually motivated without reduction to self-defined inputs

full rationale

The paper motivates ITEM by noting an alignment between RAG's three components and Schutz's three types of relevance, then proposes the iterative framework on that basis. No equations, fitted parameters, or predictions appear in the provided text, so no step reduces a claimed result to a quantity defined by the authors' own choices or prior self-citations. The mapping functions as inspirational analogy rather than a load-bearing derivation that forces outcomes by construction, and experiments are reported on independent external datasets (TREC DL, WebAP, GTI-NQ, NQ) without statistical forcing from internal fits. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested mapping between Schutz's philosophical relevances and RAG stages plus the assumption that LLMs can reliably perform the utility judgment step when prompted iteratively; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Schutz's three types of relevance correspond to the three core components of RAG and to three cognitive levels for LLMs
    Explicitly invoked in the abstract as the justification for the iterative framework.

pith-pipeline@v0.9.0 · 5741 in / 1407 out tokens · 22652 ms · 2026-05-24T00:12:56.186165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    Mistral 7B

    Performance prediction for non-factoid ques- tion answering. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Infor- mation Retrieval, pages 55–58. Gautier Izacard, Patrick Lewis, Maria Lomeli, Lu- cas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot ...

  2. [2]

    In 2016 Eleventh In- ternational Conference on Digital Information Man- agement (ICDIM), pages 121–126

    Towards automatic generation of relevance judgments for a test collection. In 2016 Eleventh In- ternational Conference on Digital Information Man- agement (ICDIM), pages 121–126. IEEE. Meta. 2024. Welcome llama 3 - meta’s new open llm. Donald Metzler and W Bruce Croft. 2005. A markov random field model for term dependencies. In Pro- ceedings of the 28th a...

  3. [3]

    MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Inte- grating neural and symbolic approaches 2016 co- located with the 30th Annual Conference on Neu- ral Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016 , volume 1773 of CEUR Workshop Proceedings. CEUR...

  4. [4]

    Passage Re-ranking with BERT

    Passage re-ranking with BERT. CoRR, abs/1901.04085. OpenAI. 2022. Introducing chatgpt. OpenAI. 2023. Gpt-4 technical report. Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. Rankvicuna: Zero-shot listwise doc- ument reranking with open-source large language models. arXiv preprint arXiv:2309.15088. Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuan...

  5. [5]

    arXiv preprint arXiv:2312.06585

    Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585. Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming- Wei Chang. 2022. ASQA: factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab ...

  6. [6]

    The reference answer may not be the correct answer, but it provides a pattern of the correct answer

    Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance la- bels. 2024 Annual Conference of the North Amer- ican Chapter of the Association for Computational Linguistics. A Instruction Details A.1 Instruction of Listwise and Pointwise Approaches For the prompts of the NQ dataset using ChatGPT, we follow the setting of Zhang e...

  7. [8]

    sentences

    {{passage_2}} assistant : Received passage [2] (more passages) ... user: Question: {query}. Reference answer: {answer}. The requirements for judging whether a passage has utility in answering the question are: The passage has utility in answering the question, meaning that the passage not only be relevant to the question, but also be useful in generating ...

  8. [10]

    user: Query: {query}

    {{passage_2}} assistant : Received passage [2] (more passages) ... user: Query: {query}. Reference answer: {answer} Rank the {num} passages above based on their relevance to the query. The passages should be listed in descending order using identifiers. The most relevant passages should be listed first. The output format should be [] > [] > [] > ..., e.g....

  9. [12]

    user: Question: {query}

    {{passage_2}} assistant : Received passage [2] (more passages) ... user: Question: {query}. Reference answer: {answer} Rank the {num} passages above based on their utility in generating the reference answer to the question. The passages should be listed in utility descending order using identifiers. The passages that have utility in generating the referen...

  10. [13]

    {{passage_1}} assistant : Received passage [1] user:

  11. [14]

    N@k” means “NDCG@k

    {{passage_2}} assistant : Received passage [2] (more passages) ... user: Query: {query}. Rank the {num} passages above based on their relevance to the query. The passages should be listed in descending order using identifiers. The most relevant passages should be listed first. The output format should be [] > [] > [] > ..., e.g., [i] > [ j] > [k] > ... On...