Diagnosing LLM-based Rerankers in Cold-Start Recommender Systems: Coverage, Exposure and Practical Mitigations

Ekaterina Lemdiasova; Nikita Zmanovskii

arxiv: 2604.16318 · v1 · submitted 2026-02-09 · 💻 cs.IR · cs.CL

Diagnosing LLM-based Rerankers in Cold-Start Recommender Systems: Coverage, Exposure and Practical Mitigations

Ekaterina Lemdiasova , Nikita Zmanovskii This is my paper

Pith reviewed 2026-05-16 05:20 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords LLM rerankerscold-start recommendationsretrieval coverageexposure biasrecommender systemscross-encoderpopularity baseline

0 comments

The pith

LLM rerankers fail in cold-start recommendations mainly because initial retrieval misses relevant items, letting simple popularity baselines win by a wide margin.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper diagnoses why LLM-based cross-encoder rerankers underperform in cold-start movie recommendation on the Serendipity-2018 dataset. Through controlled tests with 500 users, it isolates three failure modes: retrieval coverage reaches only recall@200 of 0.109 versus 0.609 for baselines, recommendations collapse to just 3 unique items instead of nearly 500, and scores barely separate relevant from irrelevant items. Popularity ranking achieves HR@10 of 0.268 against 0.008 for the LLM reranker, showing the gap stems from the retrieval stage rather than reranker limitations. The authors outline concrete fixes including hybrid candidate generation and score calibration.

Core claim

Controlled experiments reveal that LLM rerankers achieve low retrieval coverage (recall@200 = 0.109), extreme exposure bias (concentrating on 3 unique items versus 497 for random), and minimal score discrimination (mean difference 0.098, Cohen's d = 0.13), so that popularity-based ranking delivers substantially higher hit rates (HR@10: 0.268 vs. 0.008) with the performance difference driven by retrieval-stage limitations rather than reranker capacity.

What carries the argument

Diagnostic experiments that measure and separate retrieval coverage, exposure bias, and score discrimination to locate the source of LLM reranker failure in cold-start settings.

If this is right

Retrieval-stage limits explain most of the gap between LLM rerankers and simple baselines.
Hybrid retrieval strategies that combine multiple candidate sources raise coverage and downstream performance.
Increasing candidate pool size and applying score calibration improve discrimination and reduce exposure bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LLM rerankers may deliver stronger results when tightly coupled to improved retrieval models instead of applied after weak candidate sets.
Exposure bias toward a tiny item subset could recur in any domain with large catalogs unless candidate generation is diversified.
The recommended mitigations should be tested on non-movie datasets to check whether retrieval remains the dominant bottleneck.

Load-bearing premise

The three failure modes and the outperformance of popularity ranking hold beyond the Serendipity-2018 dataset and the specific cross-encoder rerankers tested.

What would settle it

Replace the candidate generator with one that achieves recall@200 above 0.5 on the same users and dataset, then check whether the LLM reranker exceeds the popularity baseline in HR@10.

Figures

Figures reproduced from arXiv: 2604.16318 by Ekaterina Lemdiasova, Nikita Zmanovskii.

**Figure 2.** Figure 2: Coverage Analysis: Recall@K curves for all methods. FAISS [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Top-1 Exposure: Cross-encoder reranker concentrates recommenda [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Item Exposure Distribution: Left shows histogram of exposure counts. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Score Distribution Analysis (seed=42): Left shows overlapping his [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Large language models (LLMs) and cross-encoder rerankers have gained attention for improving recommender systems, particularly in cold-start scenarios where user interaction history is limited. However, practical deployment reveals significant performance gaps between LLM-based approaches and simple baselines. This paper presents a systematic diagnostic study of cross-encoder rerankers in cold-start movie recommendation using the Serendipity-2018 dataset. Through controlled experiments with 500 users across multiple random seeds, we identify three critical failure modes: (1) low retrieval coverage in candidate generation (recall@200 = 0.109 vs. 0.609 for baselines), (2) severe exposure bias with rerankers concentrating recommendations on 3 unique items versus 497 for random baseline, and (3) minimal score discrimination between relevant and irrelevant items (mean difference = 0.098, Cohen's d = 0.13). We demonstrate that popularity-based ranking substantially outperforms LLM reranking (HR@10: 0.268 vs. 0.008, p < 0.001), with the performance gap primarily attributable to retrieval stage limitations rather than reranker capacity. Based on these findings, we provide actionable recommendations including hybrid retrieval strategies, candidate pool size optimization, and score calibration techniques. All code, configurations, and experimental results are made available for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM rerankers underperform popularity baselines in cold-start on Serendipity-2018 mainly because of weak candidate coverage and exposure collapse.

read the letter

The main finding is that on Serendipity-2018 with 500 users, LLM cross-encoder rerankers deliver far worse results than a simple popularity baseline in cold-start movie recommendation. HR@10 falls to 0.008 from 0.268, and the authors tie most of the gap to retrieval-stage problems rather than the reranker itself. They report recall@200 at 0.109, rerankers collapsing exposure to just three unique items, and minimal score discrimination with Cohen's d of 0.13. Multiple seeds and p-values back the comparisons. The code release lets anyone check the numbers directly. The experiments are clean for what they test and the internal logic holds: low coverage in the candidate pool explains why reranking cannot recover. The work is useful because it moves past vague claims about LLM promise and shows concrete failure modes with actionable suggestions like hybrid retrieval and calibration. The soft spot is scope. Everything rests on one dataset and the tested cross-encoders. No trials use stronger candidate generators or other cold-start corpora, so it remains open whether the same rerankers would show better coverage or discrimination under different retrieval conditions. The causal attribution to retrieval limits is plausible here but not yet stress-tested elsewhere. This is worth peer review for recsys researchers who need to know where current LLM rerankers actually break in practice. The empirical diagnosis is sharp enough to discuss even if broader validation would help.

Referee Report

1 major / 3 minor

Summary. The paper claims that LLM-based cross-encoder rerankers suffer from three key failure modes in cold-start movie recommendation on the Serendipity-2018 dataset: insufficient coverage in the candidate generation stage (recall@200 = 0.109), extreme exposure bias where recommendations focus on only 3 unique items, and weak discrimination in scoring relevant vs. irrelevant items (mean difference 0.098, Cohen's d = 0.13). Through experiments with 500 users and multiple seeds, it shows popularity-based ranking achieves much higher HR@10 (0.268 vs. 0.008 for LLM reranking, p<0.001), attributing the gap mainly to retrieval limitations. The work offers mitigations like hybrid retrieval and releases all code for reproducibility.

Significance. This diagnostic study is significant for highlighting real-world limitations of LLM rerankers in recommender systems, particularly cold-start cases where interaction data is scarce. By quantifying the gaps with statistical rigor (multiple seeds, p-values) and contrasting against simple baselines like popularity ranking, it provides evidence that retrieval quality is a critical bottleneck. The practical recommendations and open-sourcing of experiments add substantial value for practitioners and researchers exploring LLM integration in IR systems.

major comments (1)

§4.3 (Attribution of Performance Gap): The assertion that the performance gap is primarily due to retrieval-stage limitations rather than reranker capacity is based on the low recall@200 and small Cohen's d=0.13. However, this would be more convincing with an additional experiment reranking a higher-quality candidate set (e.g., from a collaborative filtering retriever with recall@200 > 0.5). Without it, the causal claim remains partially untested within the reported setup.

minor comments (3)

Abstract: The abstract provides excellent concrete numbers, but the full paper should define HR@10, recall@K, and Cohen's d at their first mention in the main body for readers unfamiliar with the metrics.
Figure 1 (Exposure Bias): The visualization of exposure bias would be clearer if it included error bars or the distribution for the random baseline to allow direct comparison with the reported 497 unique items.
Related Work: Consider adding references to recent works on LLM reranking in recsys (e.g., papers from RecSys 2023-2024) to better contextualize the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our diagnostic study. The suggestion to strengthen the causal attribution in §4.3 is well-taken, and we address it directly below with a commitment to revision.

read point-by-point responses

Referee: [—] §4.3 (Attribution of Performance Gap): The assertion that the performance gap is primarily due to retrieval-stage limitations rather than reranker capacity is based on the low recall@200 and small Cohen's d=0.13. However, this would be more convincing with an additional experiment reranking a higher-quality candidate set (e.g., from a collaborative filtering retriever with recall@200 > 0.5). Without it, the causal claim remains partially untested within the reported setup.

Authors: We agree that directly testing the reranker on a higher-recall candidate set would provide stronger causal evidence. In the revised manuscript we will add an experiment that reranks candidates from a collaborative filtering retriever (e.g., matrix factorization or ALS) achieving recall@200 > 0.5 on Serendipity-2018. We will report the resulting HR@10, exposure statistics, and score discrimination metrics for the LLM reranker under this improved retrieval condition, allowing readers to isolate the reranker's contribution more clearly. This addition directly addresses the concern while preserving the original finding that retrieval coverage remains the dominant bottleneck. revision: yes

Circularity Check

0 steps flagged

No significant circularity; all claims are direct empirical measurements

full rationale

The paper reports controlled experiments on the Serendipity-2018 dataset with held-out test interactions. All key quantities (recall@200 = 0.109, HR@10 0.268 vs 0.008, Cohen's d = 0.13, exposure to 3 items) are computed directly from observed data and random seeds. No equations, fitted parameters, or predictions are defined in terms of themselves. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing derivations. The central comparison of popularity vs LLM reranking is a straightforward empirical contrast, not a constructed result. Generalization limits exist but do not constitute circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard statistical testing and the representativeness of the Serendipity-2018 cold-start split; no new free parameters or invented entities are introduced.

axioms (1)

standard math Standard two-sample statistical tests with p < 0.001 correctly identify meaningful differences in HR@10 and recall.
Invoked when reporting p-values for popularity vs LLM reranker comparisons.

pith-pipeline@v0.9.0 · 5550 in / 1275 out tokens · 31419 ms · 2026-05-16T05:20:48.587481+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify three critical failure modes: (1) low retrieval coverage... (2) severe exposure bias... (3) minimal score discrimination... HR@10: 0.268 vs. 0.008
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Recall@200 = 0.109 vs. 0.609... Cohen’s d = 0.13

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

A survey on large language models for recommendation.arXiv preprint arXiv:2305.19860,

Y . Wang et al., “Large Language Models for Recommender Systems: A Survey,”arXiv preprint arXiv:2305.19860, 2023

work page arXiv 2023
[2]

ColdRAG: Retrieval-Augmented Generation for Cold- Start Recommendation,

L. Chen et al., “ColdRAG: Retrieval-Augmented Generation for Cold- Start Recommendation,”arXiv preprint arXiv:2410.12345, 2024

work page arXiv 2024
[3]

Getting to Know You: Learning New User Preferences in Recommender Systems,

A. M. Rashid et al., “Getting to Know You: Learning New User Preferences in Recommender Systems,” inProc. ACM IUI, 2002, pp. 127-134

work page 2002
[4]

Neural Reranking for Information Retrieval: A Survey,

Z. Liu et al., “Neural Reranking for Information Retrieval: A Survey,” ACM Computing Surveys, vol. 55, no. 6, pp. 1-35, 2022

work page 2022
[5]

MS MARCO: A Human Generated Machine Reading Comprehension Dataset,

T. Bajaj et al., “MS MARCO: A Human Generated Machine Reading Comprehension Dataset,”NeurIPS Datasets Track, 2016

work page 2016
[6]

Language-Model Prior Overcomes Cold-Start Items,

K. Zhang et al., “Language-Model Prior Overcomes Cold-Start Items,” arXiv preprint arXiv:2411.09065, 2024

work page arXiv 2024
[7]

Adaptive Candidate Retrieval with Dynamic Knowledge Graph Construction for Cold-Start Recommendation,

M. Johnson et al., “Adaptive Candidate Retrieval with Dynamic Knowledge Graph Construction for Cold-Start Recommendation,”arXiv preprint arXiv:2505.20773, 2025

work page arXiv 2025
[8]

Calibrated Recommendations,

H. Steck, “Calibrated Recommendations,” inProc. ACM RecSys, 2018, pp. 154-162

work page 2018
[9]

Evaluating Recommendation Systems,

G. Shani and A. Gunawardana, “Evaluating Recommendation Systems,” inRecommender Systems Handbook, Springer, 2011, pp. 257-297

work page 2011
[10]

Fairness of Exposure in Rankings,

A. Singh and T. Joachims, “Fairness of Exposure in Rankings,” inProc. ACM SIGKDD, 2018, pp. 2219-2228

work page 2018
[11]

Exploring the Potential of LLMs for Serendipity Evalua- tion in Recommender Systems,

Y . Liu et al., “Exploring the Potential of LLMs for Serendipity Evalua- tion in Recommender Systems,”arXiv preprint arXiv:2507.17290, 2025

work page arXiv 2025
[12]

Expel: Llm agents are experiential learners,

S. Kumar et al., “Calibration of Large Language Models: A Survey,” arXiv preprint arXiv:2308.10144, 2023

work page arXiv 2023
[13]

Out-of-Distribution Robustness of Large Language Mod- els,

P. Liu et al., “Out-of-Distribution Robustness of Large Language Mod- els,” inProc. NeurIPS, 2023, pp. 15420-15433

work page 2023
[14]

Bias and Debias in Recommender Systems: A Survey and Future Directions,

J. Chen et al., “Bias and Debias in Recommender Systems: A Survey and Future Directions,”ACM Trans. Inf. Syst., vol. 41, no. 3, pp. 1-39, 2023

work page 2023
[15]

Hybrid Recommender Systems: Survey and Experiments,

R. Burke, “Hybrid Recommender Systems: Survey and Experiments,” User Modeling and User-Adapted Interaction, vol. 12, no. 4, pp. 331- 370, 2002

work page 2002
[16]

Billion-scale Similarity Search with GPUs,

J. Johnson et al., “Billion-scale Similarity Search with GPUs,”IEEE Trans. Big Data, vol. 7, no. 3, pp. 535-547, 2021

work page 2021
[17]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” inProc. EMNLP-IJCNLP, 2019, pp. 3982-3992

work page 2019
[18]

Dense Passage Retrieval for Open-Domain Ques- tion Answering,

V . Karpukhin et al., “Dense Passage Retrieval for Open-Domain Ques- tion Answering,” inProc. EMNLP, 2020, pp. 6769-6781

work page 2020
[19]

Evaluation of Recommender Systems: A Framework,

C. W. Kofod-Petersen and M. M. Aamodt, “Evaluation of Recommender Systems: A Framework,” inProc. ICCBR Workshops, 2009

work page 2009
[20]

Facing the Cold Start Problem in Recommender Systems,

B. Lika et al., “Facing the Cold Start Problem in Recommender Systems,”Expert Systems with Applications, vol. 41, no. 4, pp. 2065- 2073, 2014

work page 2065

[1] [1]

A survey on large language models for recommendation.arXiv preprint arXiv:2305.19860,

Y . Wang et al., “Large Language Models for Recommender Systems: A Survey,”arXiv preprint arXiv:2305.19860, 2023

work page arXiv 2023

[2] [2]

ColdRAG: Retrieval-Augmented Generation for Cold- Start Recommendation,

L. Chen et al., “ColdRAG: Retrieval-Augmented Generation for Cold- Start Recommendation,”arXiv preprint arXiv:2410.12345, 2024

work page arXiv 2024

[3] [3]

Getting to Know You: Learning New User Preferences in Recommender Systems,

A. M. Rashid et al., “Getting to Know You: Learning New User Preferences in Recommender Systems,” inProc. ACM IUI, 2002, pp. 127-134

work page 2002

[4] [4]

Neural Reranking for Information Retrieval: A Survey,

Z. Liu et al., “Neural Reranking for Information Retrieval: A Survey,” ACM Computing Surveys, vol. 55, no. 6, pp. 1-35, 2022

work page 2022

[5] [5]

MS MARCO: A Human Generated Machine Reading Comprehension Dataset,

T. Bajaj et al., “MS MARCO: A Human Generated Machine Reading Comprehension Dataset,”NeurIPS Datasets Track, 2016

work page 2016

[6] [6]

Language-Model Prior Overcomes Cold-Start Items,

K. Zhang et al., “Language-Model Prior Overcomes Cold-Start Items,” arXiv preprint arXiv:2411.09065, 2024

work page arXiv 2024

[7] [7]

Adaptive Candidate Retrieval with Dynamic Knowledge Graph Construction for Cold-Start Recommendation,

M. Johnson et al., “Adaptive Candidate Retrieval with Dynamic Knowledge Graph Construction for Cold-Start Recommendation,”arXiv preprint arXiv:2505.20773, 2025

work page arXiv 2025

[8] [8]

Calibrated Recommendations,

H. Steck, “Calibrated Recommendations,” inProc. ACM RecSys, 2018, pp. 154-162

work page 2018

[9] [9]

Evaluating Recommendation Systems,

G. Shani and A. Gunawardana, “Evaluating Recommendation Systems,” inRecommender Systems Handbook, Springer, 2011, pp. 257-297

work page 2011

[10] [10]

Fairness of Exposure in Rankings,

A. Singh and T. Joachims, “Fairness of Exposure in Rankings,” inProc. ACM SIGKDD, 2018, pp. 2219-2228

work page 2018

[11] [11]

Exploring the Potential of LLMs for Serendipity Evalua- tion in Recommender Systems,

Y . Liu et al., “Exploring the Potential of LLMs for Serendipity Evalua- tion in Recommender Systems,”arXiv preprint arXiv:2507.17290, 2025

work page arXiv 2025

[12] [12]

Expel: Llm agents are experiential learners,

S. Kumar et al., “Calibration of Large Language Models: A Survey,” arXiv preprint arXiv:2308.10144, 2023

work page arXiv 2023

[13] [13]

Out-of-Distribution Robustness of Large Language Mod- els,

P. Liu et al., “Out-of-Distribution Robustness of Large Language Mod- els,” inProc. NeurIPS, 2023, pp. 15420-15433

work page 2023

[14] [14]

Bias and Debias in Recommender Systems: A Survey and Future Directions,

J. Chen et al., “Bias and Debias in Recommender Systems: A Survey and Future Directions,”ACM Trans. Inf. Syst., vol. 41, no. 3, pp. 1-39, 2023

work page 2023

[15] [15]

Hybrid Recommender Systems: Survey and Experiments,

R. Burke, “Hybrid Recommender Systems: Survey and Experiments,” User Modeling and User-Adapted Interaction, vol. 12, no. 4, pp. 331- 370, 2002

work page 2002

[16] [16]

Billion-scale Similarity Search with GPUs,

J. Johnson et al., “Billion-scale Similarity Search with GPUs,”IEEE Trans. Big Data, vol. 7, no. 3, pp. 535-547, 2021

work page 2021

[17] [17]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” inProc. EMNLP-IJCNLP, 2019, pp. 3982-3992

work page 2019

[18] [18]

Dense Passage Retrieval for Open-Domain Ques- tion Answering,

V . Karpukhin et al., “Dense Passage Retrieval for Open-Domain Ques- tion Answering,” inProc. EMNLP, 2020, pp. 6769-6781

work page 2020

[19] [19]

Evaluation of Recommender Systems: A Framework,

C. W. Kofod-Petersen and M. M. Aamodt, “Evaluation of Recommender Systems: A Framework,” inProc. ICCBR Workshops, 2009

work page 2009

[20] [20]

Facing the Cold Start Problem in Recommender Systems,

B. Lika et al., “Facing the Cold Start Problem in Recommender Systems,”Expert Systems with Applications, vol. 41, no. 4, pp. 2065- 2073, 2014

work page 2065