LLM-as-a-Judge for Reliable and Explainable Offline Evaluation in Top-K Recommendation

Chen Ma; Haiming Jin; Junyi Zhou; Qiao Xiang; Xiaokun Zhang; Yue Que

arxiv: 2606.22961 · v1 · pith:GUTIB7LEnew · submitted 2026-06-22 · 💻 cs.IR

LLM-as-a-Judge for Reliable and Explainable Offline Evaluation in Top-K Recommendation

Yue Que , Junyi Zhou , Xiaokun Zhang , Haiming Jin , Qiao Xiang , Chen Ma This is my paper

Pith reviewed 2026-06-26 06:57 UTC · model grok-4.3

classification 💻 cs.IR

keywords LLM judgeoffline evaluationtop-K recommendationsemantic proxyexplainable evaluationrecommender systemsreliability

0 comments

The pith

LLM-as-a-Judge uses semantic proxies from user text to deliver reliable and explainable top-K recommendation evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional offline top-K evaluation relies on holdout feedback as a proxy for preferences but suffers from exposure bias and rigid ID matching that distorts results, while also providing only numerical scores without insight. This paper proposes an LLM-as-a-Judge framework that extracts a semantic proxy from user textual behaviors to represent preferences, then applies a reasoning-then-scoring process for flexible semantic matching and explicit rationales. The individual judgments are aggregated into standard Top-K metrics with justifications for hits and misses. A sympathetic reader would care because this could produce evaluations less distorted by how items were shown to users and more transparent about why a recommendation succeeds or fails.

Core claim

The LLM Judge framework replaces rigid ID matching on biased holdout data with semantic matching on a textual preference proxy, using an LLM to generate reasoned relevance judgments that aggregate into reliable Top-K metrics while supplying explicit justifications for each assessment.

What carries the argument

The LLM Judge that executes a reasoning-then-scoring process on semantic proxies derived from user textual behaviors to produce relevance judgments and rationales.

If this is right

Recommendation quality can be measured via flexible semantic matching rather than exact item ID matches on holdout feedback.
Each preference hit or miss receives an explicit rationale from the LLM to support the numerical scores.
Global Top-K metrics are computed by aggregating the individual reasoned judgments.
The evaluation process maintains robustness when tested across varied recommendation models and datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support evaluation in settings where holdout interaction data is sparse or unavailable.
It opens the possibility of assessing qualitative aspects such as explanation quality alongside relevance.
Existing offline benchmarks may need re-examination if semantic proxies consistently diverge from ID-based results.

Load-bearing premise

The semantic proxy extracted from user textual behaviors accurately captures true preferences without the exposure bias that affects holdout interaction data.

What would settle it

A controlled study on a dataset with unbiased preference labels where traditional ID-based metrics and the LLM Judge produce materially different top-K rankings, and the LLM rankings show lower correlation with the unbiased labels.

Figures

Figures reproduced from arXiv: 2606.22961 by Chen Ma, Haiming Jin, Junyi Zhou, Qiao Xiang, Xiaokun Zhang, Yue Que.

**Figure 2.** Figure 2: Evaluation Prompt applied in our LLM Judge. Basic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Scatter plot between proxy evaluation and unbiased evaluation. Diagonal line [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Clustering results of the judgment rationales after [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation performance on Recall@5 and NDCG@5 of the Coat dataset. Dashed baseline represents the benchmark [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Recommendation evaluation plays a crucial role in guiding the refinement and deployment of recommender systems. Most existing trials rely on offline evaluation using Top-K metrics computed over holdout user behaviors. However, we identify two fundamental limitations that undermine their ability to deliver reliable and explainable evaluations. Regarding reliability, offline evaluation treats observed user feedback as a proxy of true preferences and enforces rigid ID matching between the proxy and recommendation. In practice, feedback collections are inherently shaped by incomplete and biased item exposure, leading to distorted and unreliable assessments. Regarding explainability, Top-K metrics only establish numerical scores without offering meaningful insights to support them, thereby reinforcing the black-box nature of offline evaluation. In this paper, we propose a reliable and explainable LLM-as-a-Judge framework for offline recommendation evaluation. To enhance reliability, we introduce a semantic proxy from user textual behaviors to represent their true preferences. This proxy allows for more flexible matching between preferences and recommendations in the semantic space, rather than depending on the holdout feedback. To ensure explainability, the LLM Judge adopts a reasoning-then-scoring process to generate relevance judgments along with explicit rationale. Finally, we aggregate the individual scores into global Top-K metrics to quantify overall recommendation quality, and provide justification for each preference hit or miss. Extensive experiments demonstrate that the LLM Judge achieves solid reliability, explainability, and robustness in evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces rigid ID matching with an LLM judge on a semantic proxy from user text to cut exposure bias in top-K recsys eval, but the claims stand or fall on whether that proxy and the judgments actually deliver less distortion.

read the letter

The central idea is to build a semantic proxy from user textual behaviors, feed it to an LLM that reasons then scores relevance, and aggregate those scores into top-K metrics while keeping the rationales for explainability. This directly targets the exposure bias baked into holdout feedback and the opacity of standard numerical metrics.

The approach is new in its specific combination for recsys offline evaluation. It keeps the output format familiar enough for comparison with existing work while adding per-judgment justifications. The abstract lays out the motivation cleanly and states that experiments show reliability, explainability, and robustness.

The soft spot is the untested assumption that the textual proxy plus LLM judgments are meaningfully less biased than observed feedback. Textual signals can still reflect exposure patterns or other selection effects, and LLMs introduce their own prompt sensitivity and model-specific quirks. Without the full experimental details on proxy construction, prompt design, and controls, it is hard to judge how well those issues were handled.

This is for researchers in recommender systems and IR who care about evaluation methodology and want metrics that better support deployment decisions. It engages the literature on the stated problems without obvious internal contradictions.

I would send it to peer review so the experiments can be checked against the claims.

Referee Report

2 major / 1 minor

Summary. The paper identifies two limitations in traditional offline top-K recommendation evaluation: unreliability due to exposure bias in holdout feedback used as proxy for true preferences, and lack of explainability in numerical metrics. It proposes an LLM-as-a-Judge framework that introduces a semantic proxy from user textual behaviors for flexible semantic space matching, uses a reasoning-then-scoring process to generate relevance judgments with rationales, and aggregates these into global Top-K metrics with justifications for hits and misses. The paper claims that extensive experiments show the framework achieves solid reliability, explainability, and robustness.

Significance. If the results hold, this framework could offer a valuable alternative to standard offline evaluation methods in recommender systems by addressing exposure bias and providing explainable outputs. This is significant for the field as it attempts to make evaluations more aligned with true user preferences. The approach is novel in applying LLM reasoning to recsys evaluation.

major comments (2)

[Abstract and methods description] The central claim rests on the semantic proxy from textual behaviors being less distorted by exposure bias than holdout ID matching; the manuscript provides no validation experiment or direct comparison demonstrating this (e.g., in the methods or results sections), which is load-bearing for the reliability argument.
[Framework description] The aggregation of individual LLM judgments into global Top-K metrics is described at a high level but lacks detail on how per-item rationales are combined without introducing new selection bias; this step is central to claiming equivalence or superiority to standard metrics.

minor comments (1)

[Abstract] The abstract asserts 'extensive experiments' but provides no quantitative results, datasets, or baselines; adding a sentence summarizing key metrics would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and methods description] The central claim rests on the semantic proxy from textual behaviors being less distorted by exposure bias than holdout ID matching; the manuscript provides no validation experiment or direct comparison demonstrating this (e.g., in the methods or results sections), which is load-bearing for the reliability argument.

Authors: We agree this is a load-bearing claim and that the current experiments demonstrate overall reliability without a direct head-to-head validation of exposure-bias reduction in the semantic proxy versus ID matching. In the revised version we will add a targeted experiment that simulates controlled exposure bias and measures alignment of each proxy with held-out true preferences. revision: yes
Referee: [Framework description] The aggregation of individual LLM judgments into global Top-K metrics is described at a high level but lacks detail on how per-item rationales are combined without introducing new selection bias; this step is central to claiming equivalence or superiority to standard metrics.

Authors: We acknowledge that the aggregation procedure is presented at a high level and that explicit safeguards against new selection bias are not detailed. The revised manuscript will expand Section 3.3 with the precise aggregation algorithm, including how rationales are weighted, how ties or low-confidence judgments are handled, and the bias-mitigation steps employed. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an LLM-as-a-Judge framework that introduces a semantic proxy derived from user textual behaviors and a reasoning-then-scoring process for relevance judgments. No equations, fitted parameters, or derivation steps are present in the provided text that reduce a claimed prediction or result to its own inputs by construction. The central claims rest on the introduction of new components and experimental validation rather than self-referential definitions or load-bearing self-citations. The argument is self-contained as a methodological proposal without internal reduction to prior fitted values or ansatzes from the authors' own prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities can be extracted. The central claim rests on unstated assumptions about LLM judgment quality and the fidelity of textual proxies.

pith-pipeline@v0.9.1-grok · 5786 in / 1063 out tokens · 29319 ms · 2026-06-26T06:57:56.231781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 4 linked inside Pith

[1]

Christine Bauer, Eva Zangerle, and Alan Said. 2024. Exploring the Landscape of Recommender Systems Evaluation: Practices and Perspectives.ACM Trans. Recomm. Syst.2, 1, Article 11 (March 2024), 31 pages

2024
[2]

Joeran Beel, Marcel Genzmehr, Stefan Langer, Andreas Nürnberger, and Bela Gipp
[3]

InProceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation (Hong Kong, China)(RepSys ’13)

A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation. InProceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation (Hong Kong, China)(RepSys ’13). Association for Computing Machinery, New York, NY, USA, 7–14
[4]

Rocío Cañamares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation options for recommender systems.Inf. Retr.23, 4 (March 2020), 387–410

2020
[5]

Pablo Castells and Alistair Moffat. 2022. Offline recommender system evaluation: Challenges and new directions.AI Magazine43, 2 (2022), 225–238

2022
[6]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu
[7]

arXiv:2402.03216

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216

Pith/arXiv arXiv
[8]

Lei Chen, Le Wu, Richang Hong, Kun Zhang, and Meng Wang. 2020. Revisiting Graph based Collaborative Filtering: A Linear Residual Graph Convolutional Network Approach. arXiv:2001.10167

arXiv 2020
[9]

Xu Chen, Yongfeng Zhang, and Ji-Rong Wen. 2022. Measuring "Why" in Rec- ommender Systems: a Comprehensive Survey on the Evaluation of Explainable Recommendation. arXiv:2202.06466

arXiv 2022
[10]

DeepSeek-AI, Aixin Liu, Bei Feng, et al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437

Pith/arXiv arXiv 2025
[11]

Francesco Fabbri, Gustavo Penha, Edoardo D’Amico, Alice Wang, Marco De Nadai, Jackie Doremus, Paul Gigioli, Andreas Damianou, Oskar Stål, and Mounia Lalmas
[12]

InProceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25)

Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge. InProceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25). Association for Computing Machinery, New York, NY, USA, 1181–1186
[13]

Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. InProceedings of the 2023 ACM SI- GIR International Conference on Theory of Information Retrieva...

2023
[14]

Wenqi Fan, Xiaorui Liu, Wei Jin, Xiangyu Zhao, Jiliang Tang, and Qing Li. 2022. Graph Trend Filtering Networks for Recommendation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval(Madrid, Spain)(SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 112–121

2022
[15]

Leticia Freire de Figueiredo, Antonio A. de A. Rocha, and Aline Paes. 2025. Tell me why: how Explanation can affect Recommender Systems. InProceedings of the 2025 ACM International Conference on Interactive Media Experiences (IMX ’25). Association for Computing Machinery, New York, NY, USA, 492–493

2025
[16]

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A Fully-observed Dataset and Insights for Evaluating Recommender Systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management (Atlanta, GA, USA)(CIKM ’22). Association for Computing M...

2022
[17]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A Survey on LLM-as- a-Judge. arXiv:2411.15594

Pith/arXiv arXiv 2025
[18]

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, YongDong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. InProceedings of the 43rd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New Yor...

2020
[19]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. InProceedings of the 26th International Conference on World Wide Web(Perth, Australia)(WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 173–182

2017
[20]

Herlocker, Joseph A

Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl
[21]

Evaluating collaborative filtering recommender systems.ACM Trans. Inf. KDD 2026, August 9–13, 2026, Jeju Island, Republic of Korea. Yue Que et al. Syst.22, 1 (Jan. 2004), 5–53

2026
[22]

Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. InProceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM ’08). IEEE Computer Society, USA, 263–272

2008
[23]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages

2025
[24]

Ikotun, Absalom E

Abiodun M. Ikotun, Absalom E. Ezugwu, Laith Abualigah, Belal Abuhaija, and Jia Heming. 2023. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data.Information Sciences622 (2023), 178–210

2023
[25]

Jadidinejad, Craig Macdonald, and Iadh Ounis

Amir H. Jadidinejad, Craig Macdonald, and Iadh Ounis. 2020. Using Exploration to Alleviate Closed Loop Effects in Recommender Systems. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, China)(SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 2025–2028

2020
[26]

Jadidinejad, Craig Macdonald, and Iadh Ounis

Amir H. Jadidinejad, Craig Macdonald, and Iadh Ounis. 2021. The Simpson’s Paradox in the Offline Evaluation of Recommendation Systems.ACM Trans. Inf. Syst.40, 1, Article 4 (Sept. 2021), 22 pages

2021
[27]

Olivier Jeunen and Aleksei Ustimenko. 2024. Δ-OPE: Off-Policy Estimation with Pairs of Policies. InProceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 878–883

2024
[28]

Petr Kasalický, Rodrigo Alves, and Pavel Kordík. 2023. Bridging Offline-Online Evaluation with a Time-dependent and Popularity Bias-free Offline Metric for Recommenders. arXiv:2308.06885

arXiv 2023
[29]

Seyedeh Baharan Khatami, Sayan Chakraborty, Ruomeng Xu, and Babak Salimi
[30]

arXiv:2504.03997

Towards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems. arXiv:2504.03997

arXiv
[31]

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech- niques for Recommender Systems.Computer42, 8 (2009), 30–37

2009
[32]

Marlin and Richard S

Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative prediction and ranking with non-random missing data. InProceedings of the Third ACM Confer- ence on Recommender Systems(New York, New York, USA)(RecSys ’09). Associa- tion for Computing Machinery, New York, NY, USA, 5–12

2009
[33]

Yusuke Narita, Shota Yasui, and Kohei Yata. 2021. Debiased Off-Policy Evaluation for Recommendation Systems. InProceedings of the 15th ACM Conference on Recommender Systems(Amsterdam, Netherlands)(RecSys ’21). Association for Computing Machinery, New York, NY, USA, 372–379

2021
[34]

Yue Que, Yingyi Zhang, Xiangyu Zhao, and Chen Ma. 2025. Causality-aware Graph Aggregation Weight Estimator for Popularity Debiasing in Top-K Recom- mendation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 2471–2481

2025
[35]

Webb (Eds.)

Claude Sammut and Geoffrey I. Webb (Eds.). 2010.Holdout Evaluation. Springer US, Boston, MA, 506–507

2010
[36]

Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: debiasing learning and evaluation. InProceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48(New York, NY, USA)(ICML’16). JMLR.org, 1670–1679

2016
[37]

Jianing Sun, Yingxue Zhang, Chen Ma, Mark Coates, Huifeng Guo, Ruiming Tang, and Xiuqiang He. 2019. Multi-graph Convolution Collaborative Filtering. In2019 IEEE International Conference on Data Mining (ICDM). 1306–1311

2019
[38]

Yi-Da Tang, Er-Dan Dong, and Wen Gao. 2024. LLMs in medicine: The need for advanced evaluation systems for disruptive technologies.The Innovation5, 3 (2024)

2024
[39]

Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Language Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1930–1940

2024
[40]

Yu Tokutake, Kazushi Okamoto, Kei Harada, Atsushi Shibata, and Koki Karube
[41]

InProceedings of the 34th ACM Interna- tional Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25)

A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via Large Language Models. InProceedings of the 34th ACM Interna- tional Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 5294–5298
[42]

Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France)(SIGIR’19). Association for Computing Machinery, New York, NY, USA, 165–174

2019
[43]

Yilei Wang, Jiabao Zhao, Deniz Ones, Liang He, and Xin Xu. 2025. Evaluating the ability of large language models to emulate personality.Scientific Reports15 (01 2025)

2025
[44]

Timo Wilm and Philipp Normann. 2025. Identifying Offline Metrics that Predict Online Impact: A Pragmatic Strategy for Real-World Recommender Systems. In Proceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25). Association for Computing Machinery, New York, NY, USA, 967–970

2025
[45]

Zhuo Wu, Qinglin Jia, Chuhan Wu, Zhaocheng Du, Shuai Wang, Zan Wang, and Zhenhua Dong. 2024. RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models. arXiv:2412.11068

arXiv 2024
[46]

An Yang, Anfeng Li, Baosong Yang, et al . 2025. Qwen3 Technical Report. arXiv:2505.09388

Pith/arXiv arXiv 2025
[47]

Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Debo- rah Estrin. 2018. Unbiased offline recommender evaluation for missing-not-at- random implicit feedback. InProceedings of the 12th ACM Conference on Recom- mender Systems(Vancouver, British Columbia, Canada)(RecSys ’18). Association for Computing Machinery, New York, NY, USA, 279–287

2018
[48]

Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Lizhen Cui, and Quoc Viet Hung Nguyen. 2022. Are Graph Augmentations Necessary? Simple Graph Contrastive Learning for Recommendation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval(Madrid, Spain)(SIGIR ’22). Association for Computing Machinery,...

2022
[49]

Eva Zangerle and Christine Bauer. 2022. Evaluating Recommender Systems: Sur- vey and Framework.ACM Comput. Surv.55, 8, Article 170 (Dec. 2022), 38 pages

2022
[50]

Xiaoyu Zhang, Yishan Li, Jiayin Wang, Bowen Sun, Weizhi Ma, Peijie Sun, and Min Zhang. 2024. Large Language Models as Evaluators for Recommendation Explanations. InProceedings of the 18th ACM Conference on Recommender Systems (Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 33–42

2024
[51]

Xiaokun Zhang, Bo Xu, Chenliang Li, Bowei He, Hongfei Lin, Chen Ma, and Fenglong Ma. 2025. A Survey on Side Information-Driven Session-Based Recom- mendation: From a Data-Centric Perspective.IEEE Transactions on Knowledge and Data Engineering37, 8 (2025), 4411–4431

2025
[52]

Xiaokun Zhang, Bo Xu, Zhaochun Ren, Xiaochen Wang, Hongfei Lin, and Fen- glong Ma. 2024. Disentangling ID and Modality Effects for Session-based Rec- ommendation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, N...

2024
[53]

Xiaokun Zhang, Bo Xu, Youlin Wu, Yuan Zhong, Hongfei Lin, and Fenglong Ma. 2024. FineRec: Exploring Fine-grained Sequential Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1599–1608

2024
[54]

Discovered by the little cutie

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, Vol. 36. Curran Associates, Inc., 46595–46623. A Additional Detail...

2023

[1] [1]

Christine Bauer, Eva Zangerle, and Alan Said. 2024. Exploring the Landscape of Recommender Systems Evaluation: Practices and Perspectives.ACM Trans. Recomm. Syst.2, 1, Article 11 (March 2024), 31 pages

2024

[2] [2]

Joeran Beel, Marcel Genzmehr, Stefan Langer, Andreas Nürnberger, and Bela Gipp

[3] [3]

InProceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation (Hong Kong, China)(RepSys ’13)

A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation. InProceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation (Hong Kong, China)(RepSys ’13). Association for Computing Machinery, New York, NY, USA, 7–14

[4] [4]

Rocío Cañamares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation options for recommender systems.Inf. Retr.23, 4 (March 2020), 387–410

2020

[5] [5]

Pablo Castells and Alistair Moffat. 2022. Offline recommender system evaluation: Challenges and new directions.AI Magazine43, 2 (2022), 225–238

2022

[6] [6]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu

[7] [7]

arXiv:2402.03216

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216

Pith/arXiv arXiv

[8] [8]

Lei Chen, Le Wu, Richang Hong, Kun Zhang, and Meng Wang. 2020. Revisiting Graph based Collaborative Filtering: A Linear Residual Graph Convolutional Network Approach. arXiv:2001.10167

arXiv 2020

[9] [9]

Xu Chen, Yongfeng Zhang, and Ji-Rong Wen. 2022. Measuring "Why" in Rec- ommender Systems: a Comprehensive Survey on the Evaluation of Explainable Recommendation. arXiv:2202.06466

arXiv 2022

[10] [10]

DeepSeek-AI, Aixin Liu, Bei Feng, et al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437

Pith/arXiv arXiv 2025

[11] [11]

Francesco Fabbri, Gustavo Penha, Edoardo D’Amico, Alice Wang, Marco De Nadai, Jackie Doremus, Paul Gigioli, Andreas Damianou, Oskar Stål, and Mounia Lalmas

[12] [12]

InProceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25)

Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge. InProceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25). Association for Computing Machinery, New York, NY, USA, 1181–1186

[13] [13]

Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. InProceedings of the 2023 ACM SI- GIR International Conference on Theory of Information Retrieva...

2023

[14] [14]

Wenqi Fan, Xiaorui Liu, Wei Jin, Xiangyu Zhao, Jiliang Tang, and Qing Li. 2022. Graph Trend Filtering Networks for Recommendation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval(Madrid, Spain)(SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 112–121

2022

[15] [15]

Leticia Freire de Figueiredo, Antonio A. de A. Rocha, and Aline Paes. 2025. Tell me why: how Explanation can affect Recommender Systems. InProceedings of the 2025 ACM International Conference on Interactive Media Experiences (IMX ’25). Association for Computing Machinery, New York, NY, USA, 492–493

2025

[16] [16]

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. 2022. KuaiRec: A Fully-observed Dataset and Insights for Evaluating Recommender Systems. InProceedings of the 31st ACM International Conference on Information & Knowledge Management (Atlanta, GA, USA)(CIKM ’22). Association for Computing M...

2022

[17] [17]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A Survey on LLM-as- a-Judge. arXiv:2411.15594

Pith/arXiv arXiv 2025

[18] [18]

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, YongDong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. InProceedings of the 43rd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval(Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New Yor...

2020

[19] [19]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. InProceedings of the 26th International Conference on World Wide Web(Perth, Australia)(WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 173–182

2017

[20] [20]

Herlocker, Joseph A

Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl

[21] [21]

Evaluating collaborative filtering recommender systems.ACM Trans. Inf. KDD 2026, August 9–13, 2026, Jeju Island, Republic of Korea. Yue Que et al. Syst.22, 1 (Jan. 2004), 5–53

2026

[22] [22]

Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. InProceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM ’08). IEEE Computer Society, USA, 263–272

2008

[23] [23]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages

2025

[24] [24]

Ikotun, Absalom E

Abiodun M. Ikotun, Absalom E. Ezugwu, Laith Abualigah, Belal Abuhaija, and Jia Heming. 2023. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data.Information Sciences622 (2023), 178–210

2023

[25] [25]

Jadidinejad, Craig Macdonald, and Iadh Ounis

Amir H. Jadidinejad, Craig Macdonald, and Iadh Ounis. 2020. Using Exploration to Alleviate Closed Loop Effects in Recommender Systems. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, China)(SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 2025–2028

2020

[26] [26]

Jadidinejad, Craig Macdonald, and Iadh Ounis

Amir H. Jadidinejad, Craig Macdonald, and Iadh Ounis. 2021. The Simpson’s Paradox in the Offline Evaluation of Recommendation Systems.ACM Trans. Inf. Syst.40, 1, Article 4 (Sept. 2021), 22 pages

2021

[27] [27]

Olivier Jeunen and Aleksei Ustimenko. 2024. Δ-OPE: Off-Policy Estimation with Pairs of Policies. InProceedings of the 18th ACM Conference on Recommender Systems(Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 878–883

2024

[28] [28]

Petr Kasalický, Rodrigo Alves, and Pavel Kordík. 2023. Bridging Offline-Online Evaluation with a Time-dependent and Popularity Bias-free Offline Metric for Recommenders. arXiv:2308.06885

arXiv 2023

[29] [29]

Seyedeh Baharan Khatami, Sayan Chakraborty, Ruomeng Xu, and Babak Salimi

[30] [30]

arXiv:2504.03997

Towards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems. arXiv:2504.03997

arXiv

[31] [31]

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech- niques for Recommender Systems.Computer42, 8 (2009), 30–37

2009

[32] [32]

Marlin and Richard S

Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative prediction and ranking with non-random missing data. InProceedings of the Third ACM Confer- ence on Recommender Systems(New York, New York, USA)(RecSys ’09). Associa- tion for Computing Machinery, New York, NY, USA, 5–12

2009

[33] [33]

Yusuke Narita, Shota Yasui, and Kohei Yata. 2021. Debiased Off-Policy Evaluation for Recommendation Systems. InProceedings of the 15th ACM Conference on Recommender Systems(Amsterdam, Netherlands)(RecSys ’21). Association for Computing Machinery, New York, NY, USA, 372–379

2021

[34] [34]

Yue Que, Yingyi Zhang, Xiangyu Zhao, and Chen Ma. 2025. Causality-aware Graph Aggregation Weight Estimator for Popularity Debiasing in Top-K Recom- mendation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 2471–2481

2025

[35] [35]

Webb (Eds.)

Claude Sammut and Geoffrey I. Webb (Eds.). 2010.Holdout Evaluation. Springer US, Boston, MA, 506–507

2010

[36] [36]

Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: debiasing learning and evaluation. InProceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48(New York, NY, USA)(ICML’16). JMLR.org, 1670–1679

2016

[37] [37]

Jianing Sun, Yingxue Zhang, Chen Ma, Mark Coates, Huifeng Guo, Ruiming Tang, and Xiuqiang He. 2019. Multi-graph Convolution Collaborative Filtering. In2019 IEEE International Conference on Data Mining (ICDM). 1306–1311

2019

[38] [38]

Yi-Da Tang, Er-Dan Dong, and Wen Gao. 2024. LLMs in medicine: The need for advanced evaluation systems for disruptive technologies.The Innovation5, 3 (2024)

2024

[39] [39]

Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Language Models can Accurately Predict Searcher Preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1930–1940

2024

[40] [40]

Yu Tokutake, Kazushi Okamoto, Kei Harada, Atsushi Shibata, and Koki Karube

[41] [41]

InProceedings of the 34th ACM Interna- tional Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25)

A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via Large Language Models. InProceedings of the 34th ACM Interna- tional Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 5294–5298

[42] [42]

Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France)(SIGIR’19). Association for Computing Machinery, New York, NY, USA, 165–174

2019

[43] [43]

Yilei Wang, Jiabao Zhao, Deniz Ones, Liang He, and Xin Xu. 2025. Evaluating the ability of large language models to emulate personality.Scientific Reports15 (01 2025)

2025

[44] [44]

Timo Wilm and Philipp Normann. 2025. Identifying Offline Metrics that Predict Online Impact: A Pragmatic Strategy for Real-World Recommender Systems. In Proceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25). Association for Computing Machinery, New York, NY, USA, 967–970

2025

[45] [45]

Zhuo Wu, Qinglin Jia, Chuhan Wu, Zhaocheng Du, Shuai Wang, Zan Wang, and Zhenhua Dong. 2024. RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models. arXiv:2412.11068

arXiv 2024

[46] [46]

An Yang, Anfeng Li, Baosong Yang, et al . 2025. Qwen3 Technical Report. arXiv:2505.09388

Pith/arXiv arXiv 2025

[47] [47]

Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Debo- rah Estrin. 2018. Unbiased offline recommender evaluation for missing-not-at- random implicit feedback. InProceedings of the 12th ACM Conference on Recom- mender Systems(Vancouver, British Columbia, Canada)(RecSys ’18). Association for Computing Machinery, New York, NY, USA, 279–287

2018

[48] [48]

Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Lizhen Cui, and Quoc Viet Hung Nguyen. 2022. Are Graph Augmentations Necessary? Simple Graph Contrastive Learning for Recommendation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval(Madrid, Spain)(SIGIR ’22). Association for Computing Machinery,...

2022

[49] [49]

Eva Zangerle and Christine Bauer. 2022. Evaluating Recommender Systems: Sur- vey and Framework.ACM Comput. Surv.55, 8, Article 170 (Dec. 2022), 38 pages

2022

[50] [50]

Xiaoyu Zhang, Yishan Li, Jiayin Wang, Bowen Sun, Weizhi Ma, Peijie Sun, and Min Zhang. 2024. Large Language Models as Evaluators for Recommendation Explanations. InProceedings of the 18th ACM Conference on Recommender Systems (Bari, Italy)(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 33–42

2024

[51] [51]

Xiaokun Zhang, Bo Xu, Chenliang Li, Bowei He, Hongfei Lin, Chen Ma, and Fenglong Ma. 2025. A Survey on Side Information-Driven Session-Based Recom- mendation: From a Data-Centric Perspective.IEEE Transactions on Knowledge and Data Engineering37, 8 (2025), 4411–4431

2025

[52] [52]

Xiaokun Zhang, Bo Xu, Zhaochun Ren, Xiaochen Wang, Hongfei Lin, and Fen- glong Ma. 2024. Disentangling ID and Modality Effects for Session-based Rec- ommendation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, N...

2024

[53] [53]

Xiaokun Zhang, Bo Xu, Youlin Wu, Yuan Zhong, Hongfei Lin, and Fenglong Ma. 2024. FineRec: Exploring Fine-grained Sequential Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1599–1608

2024

[54] [54]

Discovered by the little cutie

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, Vol. 36. Curran Associates, Inc., 46595–46623. A Additional Detail...

2023