FRESCO: Benchmarking and Optimizing Re-rankers for Evolving Semantic Conflict in Retrieval-Augmented Generation

2); (2) UCLA); Alexander Min (1) ((1) Meta Superintelligence Labs; Cho-Jui Hsieh (2); Chun-cheng Jason Chen (1); Hayeon Lee (1); Shuibenyang Yuan (1); Sohyun An (1; Vijai Mohan (1)

arxiv: 2604.14227 · v1 · submitted 2026-04-14 · 💻 cs.IR · cs.AI

FRESCO: Benchmarking and Optimizing Re-rankers for Evolving Semantic Conflict in Retrieval-Augmented Generation

Sohyun An (1 , 2) , Hayeon Lee (1) , Shuibenyang Yuan (1) , Chun-cheng Jason Chen (1) , Cho-Jui Hsieh (2) , Vijai Mohan (1) , Alexander Min (1) ((1) Meta Superintelligence Labs

show 1 more author

(2) UCLA)

This is my paper

Pith reviewed 2026-05-10 14:17 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords retrieval-augmented generationre-rankersevolving knowledgetemporal semantic conflictbenchmarkinstruction optimizationfactual recencysemantic relevance

0 comments

The pith

Re-rankers in RAG pipelines consistently prefer older semantically rich documents over newer factual ones, even when the older versions are obsolete.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented generation uses re-rankers to select the best documents from initial retrieval results, yet existing evaluations focus on static, unchanging information. This paper creates FRESCO to test re-rankers on evolving knowledge by pairing queries that seek current facts with pairs of old and new Wikipedia article versions. The evaluation shows a clear pattern where re-rankers choose the older, more detailed documents despite their factual inaccuracies. The authors then apply an instruction optimization process that finds instructions balancing the two types of tasks, producing gains of up to 27 percent on evolving knowledge while preserving results on stable tasks. This matters because many real queries involve facts that change over time, and current re-rankers risk delivering outdated answers.

Core claim

The paper introduces the FRESCO benchmark, which pairs recency-seeking queries with historical Wikipedia revisions to measure whether re-rankers can prioritize factually recent evidence while preserving semantic relevance. Evaluation across existing re-rankers identifies a consistent bias toward older documents. An instruction optimization framework then identifies Pareto-optimal instructions that deliver gains of up to 27% on Evolving Knowledge tasks while maintaining competitive performance on Non-Evolving Knowledge tasks.

What carries the argument

The FRESCO benchmark, which evaluates re-rankers by pairing recency-seeking queries with temporal Wikipedia revision pairs, together with the instruction optimization framework that searches for instructions balancing evolving and non-evolving performance.

If this is right

Re-rankers must incorporate explicit checks for factual recency in addition to semantic similarity.
Instruction optimization can reduce preference for obsolete information without sacrificing performance on stable queries.
RAG systems using optimized re-rankers will more reliably surface up-to-date evidence when facts evolve.
Static benchmarks miss critical failure modes that appear only under temporal change.
Pareto-optimal instruction search offers a practical way to tune re-rankers for mixed evolving and non-evolving workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recency bias may appear in the initial retrieval stage before re-ranking occurs.
Similar optimization could be applied to other RAG components such as query rewriting or answer generation prompts.
Expanding the benchmark beyond Wikipedia to other evolving sources would test whether the observed failure mode is domain-specific.
Real-user query logs could serve as an external check on whether the synthetic recency queries capture typical temporal conflicts.

Load-bearing premise

The selected recency-seeking queries and Wikipedia revision pairs form an unbiased, representative sample of real-world cases where semantic content and factual correctness conflict over time.

What would settle it

Apply the same re-rankers and optimization process to a fresh collection of recency-seeking queries drawn from news archives or scientific literature instead of Wikipedia and check whether the bias toward older documents and the 27% gain both remain.

read the original abstract

Retrieval-Augmented Generation (RAG) is a key approach to mitigating the temporal staleness of large language models (LLMs) by grounding responses in up-to-date evidence. Within the RAG pipeline, re-rankers play a pivotal role in selecting the most useful documents from retrieved candidates. However, existing benchmarks predominantly evaluate re-rankers in static settings and do not adequately assess performance under evolving information -- a critical gap, as real-world systems often must choose among temporally different pieces of evidence. To address this limitation, we introduce FRESCO (Factual Recency and Evolving Semantic COnflict), a benchmark for evaluating re-rankers in temporally dynamic contexts. By pairing recency-seeking queries with historical Wikipedia revisions, FRESCO tests whether re-rankers can prioritize factually recent evidence while maintaining semantic relevance. Our evaluation reveals a consistent failure mode across existing re-rankers: a strong bias toward older, semantically rich documents, even when they are factually obsolete. We further investigate an instruction optimization framework to mitigate this issue. By identifying Pareto-optimal instructions that balance Evolving and Non-Evolving Knowledge tasks, we obtain gains of up to 27% on Evolving Knowledge tasks while maintaining competitive performance on Non-Evolving Knowledge tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FRESCO's temporal benchmark for RAG re-rankers is a useful direction but its bias claims look vulnerable to how the Wikipedia revision pairs were selected.

read the letter

The main thing to know is that this paper builds a benchmark called FRESCO to test re-rankers on queries where facts have changed over time, drawn from Wikipedia edits, and shows that existing ones prefer older versions. They also try optimizing instructions to fix it and claim up to 27% better on the evolving part. What stands out as new is the focus on temporal conflict with actual revision pairs rather than synthetic or static data. The Pareto instruction search for balancing tasks is a reasonable engineering step that could be useful in practice. The paper does well in identifying a plausible real-world issue for RAG systems that need to handle updates. Using real history adds some credibility over made-up examples. The soft spots are around the evaluation setup. The concern about older documents being more semantically rich is worth checking because if the revisions were picked such that newer ones are shorter or less detailed, then semantic re-rankers would naturally score lower on them without any temporal bias. The abstract lacks specifics on how queries were made or any checks for that, and there's no word on whether they controlled for document length or embedding density. Without those, the consistent failure mode is hard to trust fully. Also, the gains need more context on baselines and variance. This paper is aimed at people working on retrieval components for LLMs, especially in production RAG where knowledge changes. A practitioner might pick up the optimization trick, while a researcher could extend the benchmark. It deserves peer review because the topic matters and the benchmark could be a starting point, though it will need stronger methodology sections to be convincing. I'd send it along with notes on the construction details.

Referee Report

2 major / 2 minor

Summary. The paper introduces FRESCO, a benchmark that pairs recency-seeking queries with historical Wikipedia revisions to evaluate re-rankers in RAG pipelines under evolving semantic conflict. It claims that existing re-rankers exhibit a consistent bias toward older, semantically rich but factually obsolete documents, and that an instruction optimization framework can deliver up to 27% gains on evolving-knowledge tasks while preserving competitive performance on non-evolving tasks.

Significance. If the benchmark construction is shown to be free of selection bias and the gains are reproducible with proper controls, the work would be significant: it identifies a practically important failure mode in current re-rankers for temporally dynamic retrieval and supplies a concrete optimization method to mitigate it. The reported 27% improvement on evolving tasks is large enough to matter for real RAG deployments if the evaluation is sound.

major comments (2)

[FRESCO benchmark construction] Benchmark construction (FRESCO description, likely §3): the central claim of a 'consistent failure mode' (bias toward older, semantically rich documents) and the 27% gains both rest on the assumption that recency labels are assigned independently of document features that re-rankers already optimize for (length, entity density, embedding norms). The manuscript must demonstrate that Wikipedia revision selection and query construction do not systematically correlate older documents with higher semantic richness; otherwise the observed bias is confounded by test-set construction rather than revealing a true re-ranker defect.
[Experiments and results] Experimental evaluation (§4–5): the abstract and evaluation summary provide no details on statistical significance testing, controls for document length or semantic similarity, or the exact definition of the Evolving vs. Non-Evolving Knowledge task splits. Without these, the magnitude of the reported gains cannot be verified and the cross-task balance claim is difficult to assess.

minor comments (2)

[§3] Clarify the precise criteria used to label queries as 'recency-seeking' and to select the historical revisions; a short table or pseudocode would help.
[Instruction optimization] The instruction optimization framework is only sketched; adding the search space size, number of candidate instructions, and how the Pareto front is computed would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [FRESCO benchmark construction] Benchmark construction (FRESCO description, likely §3): the central claim of a 'consistent failure mode' (bias toward older, semantically rich documents) and the 27% gains both rest on the assumption that recency labels are assigned independently of document features that re-rankers already optimize for (length, entity density, embedding norms). The manuscript must demonstrate that Wikipedia revision selection and query construction do not systematically correlate older documents with higher semantic richness; otherwise the observed bias is confounded by test-set construction rather than revealing a true re-ranker defect.

Authors: We appreciate this important concern about potential confounding. In the revised manuscript we will add a dedicated analysis subsection under §3 that quantifies document-level features (length, entity density, embedding norm statistics) across the selected Wikipedia revision pairs. We will report mean differences, correlation coefficients with recency labels, and statistical tests showing that older revisions are not systematically richer on these dimensions. This will directly demonstrate that the recency labels are assigned independently of the features re-rankers optimize for, confirming that the observed bias reflects a genuine re-ranker failure mode rather than test-set construction artifacts. revision: yes
Referee: [Experiments and results] Experimental evaluation (§4–5): the abstract and evaluation summary provide no details on statistical significance testing, controls for document length or semantic similarity, or the exact definition of the Evolving vs. Non-Evolving Knowledge task splits. Without these, the magnitude of the reported gains cannot be verified and the cross-task balance claim is difficult to assess.

Authors: We agree that these details are essential for reproducibility and proper interpretation. In the revised version we will: (i) add statistical significance testing (paired t-tests and Wilcoxon signed-rank tests with p-values and effect sizes) for all reported gains including the 27% figure; (ii) include explicit controls such as length-matched subsets and regression analysis on semantic similarity scores; and (iii) provide a precise definition of the Evolving vs. Non-Evolving splits, including the query-generation criteria, document categorization rules, and dataset statistics. These additions will appear in §4 and §5 with updated tables and text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and gains derived from external data and empirical optimization

full rationale

The paper introduces the FRESCO benchmark by pairing recency-seeking queries with historical Wikipedia revisions and evaluates re-rankers on it, then reports empirical gains from an instruction optimization framework that identifies Pareto-optimal instructions balancing task types. These steps rely on external data sources and standard IR evaluation practices rather than any self-definitional reduction, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or derivations in the provided text reduce by construction to the paper's own inputs; the central claims about failure modes and gains are supported by direct evaluation on the introduced benchmark without circular loops. This is the most common honest finding for benchmark papers that remain self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The work implicitly assumes standard IR notions of relevance and recency are well-defined and that Wikipedia revision history is a faithful proxy for factual evolution.

pith-pipeline@v0.9.0 · 5575 in / 1129 out tokens · 21159 ms · 2026-05-10T14:17:43.490171+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

**Generate Prompts:** Write **{steps_per_gradient}** different and improved prompts that aim to overcome the identified weaknesses

work page
[2]

**Encourage Diversity:** Each prompt should be distinct from the others

work page
[3]

{prompt_a}

**Formatting:** Wrap each new prompt individually with`<START>`and`<END>`. A.3 Crossover Operator # Role and Goal You are an expert Prompt Engineer specializing in synergistic prompt design. Your objective is to analyze two distinct prompts, identify the core reasons for their unique successes, and then synthesize these insights into a superior, hybrid pr...

work page
[4]

**Analyze Prompt A's Winning Strategy:** Based on the first set of examples, what specific phrasing, instruction, or principle in Prompt A allows it to succeed where B fails?

work page
[5]

**Analyze Prompt B's Winning Strategy:** Similarly, based on the second set of examples, what is the core strength of Prompt B that allows it to handle cases that A could not?

work page
[6]

Each new prompt must be a cohesive instruction set that aims to solve all provided examples by intelligently combining the best of A and B

**Generate Hybrid Prompts:** Synthesize these two winning strategies into **{num_crossovers}** distinct, new prompts. Each new prompt must be a cohesive instruction set that aims to solve all provided examples by intelligently combining the best of A and B

work page
[7]

Given a web search query, retrieve relevant passages that answer the query

**Formatting:** Wrap each new prompt individually with`<START>`and`<END>`. B Additional Experiments B.1 How Instructions Steer Temporal Awareness of Re-rankers Our primary results show that our instruction optimization method identifies a Pareto front capturing trade-offs betweenDEK and DNEK. In particular, theDEK-optimal point (Pareto Solution 1) yields ...

work page arXiv 2025
[8]

{candidate document 1}

work page
[9]

{candidate document 2}

work page
[10]

All the passages should be included and listed using identifiers, in descending order of relevance

{candidate document 20} Search Query: {query} Rank the 20 passages above based on their relevance to the search query. All the passages should be included and listed using identifiers, in descending order of relevance. The output format should be [] > [], e.g., [4] > [2]. Only respond with the ranking results, do not say any word or explain. ASSISTANT: D ...

work page 2025
[11]

This metric assesses the reliability of agreement beyond chance

Inter-Annotator Agreement (IAA):We measured the consistency among the three annotators using Fleiss’ Kappa (κ). This metric assesses the reliability of agreement beyond chance. A highκ value indicates that the annotation task was clear and the judgments were consistent

work page
[12]

Given a web search query, retrieve relevant passages that answer the query

Agreement with Pipeline Labels:This metric directly validates our automatic labeling process. We compared the majority vote label from the three annotators (i.e., the label agreed upon by at least two annotators) against the original positive label assigned by our pipeline. A high agreement rate signifies that our pipeline accurately identifies the correc...

work page 2025
[13]

In addition, each passage is summarized with a LLM (LLaMA-3.3-70B-Inst here)

Question and Passage Processing:The input query is segmented into its main content (MC) and temporal constraints (TC). In addition, each passage is summarized with a LLM (LLaMA-3.3-70B-Inst here)

work page
[14]

Semantic-Temporal Hybrid Ranking:A final ranking module multiplicatively combines semantic scores with symbolic temporal scores derived using temporal score functions similar to the temporal activation functions in Chen et al. (2022). We apply the principles of MRAG’s hybrid ranking logic within our re-ranking evaluation framework. F.2.3 Fine-Tuning To es...

work page 2022

[1] [1]

**Generate Prompts:** Write **{steps_per_gradient}** different and improved prompts that aim to overcome the identified weaknesses

work page

[2] [2]

**Encourage Diversity:** Each prompt should be distinct from the others

work page

[3] [3]

{prompt_a}

**Formatting:** Wrap each new prompt individually with`<START>`and`<END>`. A.3 Crossover Operator # Role and Goal You are an expert Prompt Engineer specializing in synergistic prompt design. Your objective is to analyze two distinct prompts, identify the core reasons for their unique successes, and then synthesize these insights into a superior, hybrid pr...

work page

[4] [4]

**Analyze Prompt A's Winning Strategy:** Based on the first set of examples, what specific phrasing, instruction, or principle in Prompt A allows it to succeed where B fails?

work page

[5] [5]

**Analyze Prompt B's Winning Strategy:** Similarly, based on the second set of examples, what is the core strength of Prompt B that allows it to handle cases that A could not?

work page

[6] [6]

Each new prompt must be a cohesive instruction set that aims to solve all provided examples by intelligently combining the best of A and B

**Generate Hybrid Prompts:** Synthesize these two winning strategies into **{num_crossovers}** distinct, new prompts. Each new prompt must be a cohesive instruction set that aims to solve all provided examples by intelligently combining the best of A and B

work page

[7] [7]

Given a web search query, retrieve relevant passages that answer the query

**Formatting:** Wrap each new prompt individually with`<START>`and`<END>`. B Additional Experiments B.1 How Instructions Steer Temporal Awareness of Re-rankers Our primary results show that our instruction optimization method identifies a Pareto front capturing trade-offs betweenDEK and DNEK. In particular, theDEK-optimal point (Pareto Solution 1) yields ...

work page arXiv 2025

[8] [8]

{candidate document 1}

work page

[9] [9]

{candidate document 2}

work page

[10] [10]

All the passages should be included and listed using identifiers, in descending order of relevance

{candidate document 20} Search Query: {query} Rank the 20 passages above based on their relevance to the search query. All the passages should be included and listed using identifiers, in descending order of relevance. The output format should be [] > [], e.g., [4] > [2]. Only respond with the ranking results, do not say any word or explain. ASSISTANT: D ...

work page 2025

[11] [11]

This metric assesses the reliability of agreement beyond chance

Inter-Annotator Agreement (IAA):We measured the consistency among the three annotators using Fleiss’ Kappa (κ). This metric assesses the reliability of agreement beyond chance. A highκ value indicates that the annotation task was clear and the judgments were consistent

work page

[12] [12]

Given a web search query, retrieve relevant passages that answer the query

Agreement with Pipeline Labels:This metric directly validates our automatic labeling process. We compared the majority vote label from the three annotators (i.e., the label agreed upon by at least two annotators) against the original positive label assigned by our pipeline. A high agreement rate signifies that our pipeline accurately identifies the correc...

work page 2025

[13] [13]

In addition, each passage is summarized with a LLM (LLaMA-3.3-70B-Inst here)

Question and Passage Processing:The input query is segmented into its main content (MC) and temporal constraints (TC). In addition, each passage is summarized with a LLM (LLaMA-3.3-70B-Inst here)

work page

[14] [14]

Semantic-Temporal Hybrid Ranking:A final ranking module multiplicatively combines semantic scores with symbolic temporal scores derived using temporal score functions similar to the temporal activation functions in Chen et al. (2022). We apply the principles of MRAG’s hybrid ranking logic within our re-ranking evaluation framework. F.2.3 Fine-Tuning To es...

work page 2022