arxiv: 2604.03724 · v1 · submitted 2026-04-04 · 💻 cs.IR

Recognition: no theorem link

Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation

Ben Kabongo , Arthur Satouf , Vincent Guigue

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:24 UTC · model grok-4.3

classification 💻 cs.IR

keywords explainable recommendationstatement rankingLLM extractionreview-based explanationspopularity baselinesAmazon reviewshallucination mitigation

0 comments

The pith

Ranking pre-extracted statements from reviews avoids LLM hallucinations and shows popularity baselines beating state-of-the-art models in item-level explanation ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues for reframing explainable recommendation as a statement-level ranking task instead of generating new text with large language models. Candidate explanations are pulled directly from reviews, ranked, and the top ones returned, which removes hallucination by design and allows use of standard ranking metrics. An LLM pipeline extracts atomic explanatory statements while semantic clustering removes duplicates, producing the StaR benchmark from four Amazon review categories. Tests reveal that simple popularity baselines perform competitively in global ranking but outperform current models on average when ranking statements for a specific target item.

Core claim

The paper establishes that explainable recommendation can be formalized as ranking pre-extracted candidate statements at the statement level, which by construction eliminates hallucination risks and permits the use of standard ranking metrics for evaluation. Using an LLM pipeline for extraction and semantic clustering for uniqueness, they construct the StaR benchmark across four Amazon categories. Experiments show popularity baselines are competitive globally but outperform state-of-the-art models in item-level ranking, indicating that current personalized explanation models have critical limitations.

What carries the argument

The statement-level ranking formulation, where systems rank candidate explanatory statements derived from reviews and return the top-k as the explanation.

If this is right

Explanations become directly traceable to source review text, eliminating fabricated details.
Evaluation shifts to reproducible ranking metrics such as NDCG instead of subjective generation scores.
Models must learn to identify relevant item-specific statements rather than produce free-form text.
Popularity serves as a strong baseline that any personalized ranking approach must demonstrably surpass in item-level settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current neural models may be capturing global statement popularity rather than learning true user-item relevance signals.
Grounding explanations in existing reviews could generalize to other generation tasks where factual fidelity matters.
Item-level evaluation highlights that personalization for explanations is harder than for item recommendations themselves.

Load-bearing premise

The LLM-based extraction pipeline reliably produces statements that are explanatory, atomic, and unique from noisy reviews.

What would settle it

A large-scale human evaluation showing that many statements extracted by the pipeline are not factual or atomic, or any new ranking model achieving substantially higher item-level metrics than popularity baselines on the StaR benchmark.

Figures

Figures reproduced from arXiv: 2604.03724 by Arthur Satouf, Ben Kabongo, Vincent Guigue.

**Figure 2.** Figure 2: Statement extraction and verification pipeline. (1) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Statement clustering pipeline. (1) ANN retrieves the top- [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of the 𝜇 parameter on BPER+ performance across datasets. 1 2 3 4 L 4 5 6 Precision@10 (%) Global-level Precision@10 1 2 3 4 L 10.0 12.5 15.0 17.5 Recall@10 (%) Recall@10 1 2 3 4 L 10.0 12.5 15.0 NDCG@10 (%) NDCG@10 1 2 3 4 L 10 12 Precision@10 (%) Item-level 1 2 3 4 L 20 25 30 Recall@10 (%) 1 2 3 4 L 20 25 NDCG@10 (%) Toys Clothes Beauty Sports [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of the number of layers 𝐿 on ExpGCN performance across datasets. 5.1.5 Hyperparameter Sensitivity Analysis. We analyze the impact of key hyperparameters on ranking performance [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Textual explanations, generated with large language models (LLMs), are increasingly used to justify recommendations. Yet, evaluating these explanations remains a critical challenge. We advocate a shift in objective: rank, don't generate. We formalize explainable recommendation as a statement-level ranking problem, where systems rank candidate explanatory statements derived from reviews and return the top-k as explanation. This formulation mitigates hallucination by construction and enables fine-grained factual analysis. It also models factor importance through relevance scores and supports standardized, reproducible evaluation with established ranking metrics. Meaningful assessment, however, requires each statement to be explanatory (item facts affecting user experience), atomic (one opinion about one aspect), and unique (paraphrases consolidated), which is challenging to obtain from noisy reviews. We address this with (i) an LLM-based extraction pipeline producing explanatory and atomic statements, and (ii) a scalable, semantic clustering method consolidating paraphrases to enforce uniqueness. Building on this pipeline, we introduce StaR, a benchmark for statement ranking in explainable recommendation, constructed from four Amazon Reviews 2014 product categories. We evaluate popularity-based baselines and state-of-the-art models under global-level (all statements) and item-level (target item statements) ranking. Popularity baselines are competitive in global-level ranking but outperform state-of-the-art models on average in item-level ranking, exposing critical limitations in personalized explanation ranking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ranking extracted statements from reviews instead of generating explanations is a sensible reframing, but the StaR benchmark's value hinges on unvalidated LLM extraction quality.

read the letter

The core move here is to stop generating explanations with LLMs and instead rank pre-extracted statements from reviews. That avoids hallucination by design and turns the task into a standard ranking problem with reproducible metrics. They formalize it at the statement level, require statements to be explanatory, atomic, and unique, then build an LLM pipeline plus semantic clustering to pull them from Amazon reviews and create the StaR benchmark across four categories. They run global and item-level ranking experiments comparing popularity baselines to SOTA models. The finding that popularity stays competitive globally and beats the others on average for item-level ranking is the part that stands out. It suggests current personalized explanation models may not be doing what they claim. The formulation itself is clean and the benchmark construction is a concrete artifact that others could build on. The main weakness is that the headline results depend entirely on the extraction pipeline producing good statements. The abstract gives no human evaluation, error rates, or agreement numbers on whether the outputs are actually atomic or explanatory, so noise in the statements could be driving the baseline wins. If the pipeline is shaky, the item-level outperformance claim loses force. This is for people working on factual explanations in recommender systems who want to move past generation risks. It deserves peer review because the problem setup is useful and the benchmark could matter if the extraction holds up under scrutiny. I'd send it to referees with a note to check the pipeline validation closely.

Referee Report

2 major / 2 minor

Summary. The paper proposes reformulating explainable recommendation as a statement-level ranking task instead of generation. It describes an LLM-based pipeline to extract explanatory and atomic statements from reviews, combined with semantic clustering to consolidate paraphrases for uniqueness. This pipeline is used to construct the StaR benchmark from Amazon Reviews 2014 data in four product categories. Experiments evaluate popularity baselines against state-of-the-art models under global-level (all statements) and item-level (per-item statements) ranking, reporting that popularity baselines are competitive globally but outperform SOTA models on average in item-level ranking, which the authors interpret as exposing limitations in personalized explanation ranking.

Significance. If the extracted statements prove reliable, the work offers a new benchmark and evaluation framework that avoids hallucination by construction and supports standard ranking metrics. The item-level finding that simple popularity baselines can surpass complex models would be a substantive result, potentially redirecting research toward stronger personalization mechanisms for explanations. The pipeline and clustering approach could also serve as a reusable artifact for future statement-ranking studies.

major comments (2)

[§3] §3 (LLM extraction pipeline): The validity of all downstream claims, including the StaR benchmark and the item-level outperformance result, rests on the pipeline producing statements that are explanatory (item facts affecting user experience), atomic (one opinion per aspect), and unique. No human evaluation, error rates, inter-annotator agreement, or qualitative analysis of output quality is reported. This is a load-bearing gap; without it, the reported superiority of popularity baselines may measure extraction artifacts rather than explanatory ranking quality.
[§5.2] §5.2 and Table 2 (item-level results): The headline claim that popularity baselines outperform SOTA models on average in item-level ranking is presented as evidence of 'critical limitations' in personalized ranking. This interpretation is only supported if the StaR statements are verifiably high-quality; the absence of pipeline validation makes the conclusion unsupported at present.

minor comments (2)

[Abstract / §4] The abstract and §4 state that StaR is built from 'four Amazon Reviews 2014 product categories' but do not name them; listing the categories (e.g., Electronics, Clothing) would improve reproducibility.
[§2] Notation for relevance scores and ranking metrics is introduced without an explicit table of symbols; a short notation table would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies a key area for strengthening the manuscript. We agree that explicit human validation of the LLM extraction pipeline is necessary to support the reliability of the StaR benchmark and the item-level experimental conclusions. We will revise the paper to include such validation and adjust result interpretations accordingly. Point-by-point responses to the major comments are provided below.

read point-by-point responses

Referee: [§3] §3 (LLM extraction pipeline): The validity of all downstream claims, including the StaR benchmark and the item-level outperformance result, rests on the pipeline producing statements that are explanatory (item facts affecting user experience), atomic (one opinion per aspect), and unique. No human evaluation, error rates, inter-annotator agreement, or qualitative analysis of output quality is reported. This is a load-bearing gap; without it, the reported superiority of popularity baselines may measure extraction artifacts rather than explanatory ranking quality.

Authors: We agree that the lack of human evaluation for the pipeline constitutes a significant gap, as the downstream claims depend on statement quality. Although the pipeline uses targeted prompts and semantic clustering to promote explanatory, atomic, and unique outputs, no quantitative human assessment (such as inter-annotator agreement or error rates) was included in the original submission. In the revised manuscript, we will add a new subsection to §3 reporting a human evaluation study on a sampled set of statements, including agreement metrics, precision for explanatory and atomic properties, and qualitative examples. This will directly mitigate the risk that results reflect extraction artifacts. revision: yes
Referee: [§5.2] §5.2 and Table 2 (item-level results): The headline claim that popularity baselines outperform SOTA models on average in item-level ranking is presented as evidence of 'critical limitations' in personalized ranking. This interpretation is only supported if the StaR statements are verifiably high-quality; the absence of pipeline validation makes the conclusion unsupported at present.

Authors: We concur that the interpretation of item-level outperformance as evidence of critical limitations in personalized ranking requires verified statement quality. The current manuscript presents the result without this validation, which weakens the claim. In revision, we will incorporate the human evaluation results from the updated §3 into §5.2, explicitly tying benchmark quality to the findings. We will qualify the language on 'critical limitations' to reflect the supporting evidence and, if needed, present the result more cautiously as highlighting the need for improved personalization mechanisms. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper formalizes explainable recommendation as statement-level ranking, introduces an LLM extraction pipeline plus semantic clustering to build the StaR benchmark from Amazon reviews, and evaluates baselines versus SOTA models with standard ranking metrics on global and item-level tasks. No equations, fitted parameters, or self-citations reduce the central claim (popularity outperformance on item-level ranking) to prior inputs by construction. The benchmark and pipeline are presented as independent new artifacts; the evaluation relies on established metrics rather than any self-referential loop. This is the normal case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that review-derived statements can be made explanatory, atomic, and unique at scale; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Statements extracted from reviews can be made explanatory, atomic, and unique
Stated as a prerequisite for meaningful ranking in the abstract

invented entities (1)

StaR benchmark no independent evidence
purpose: Standardized dataset for statement ranking evaluation
New collection built from Amazon Reviews 2014 categories

pith-pipeline@v0.9.0 · 5545 in / 1189 out tokens · 37894 ms · 2026-05-13T17:24:48.327070+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 8 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

2011.Mining of massive datasets

Rajaraman Anand and Ullman Jeffrey David. 2011.Mining of massive datasets. Cambridge university press

work page 2011
[3]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

work page 2005
[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

work page 2019
[6]

Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Ming Zhou, and Ke Xu

work page
[7]

InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Learning to generate product reviews from attributes. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 623–632

work page
[8]

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2025. The faiss library.IEEE Transactions on Big Data(2025)

work page 2025
[9]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

work page 2024
[10]

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166(2023)

work page arXiv 2023
[11]

Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). InProceedings of the 16th ACM conference on recommender systems. 299–315

work page 2022
[12]

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648

work page 2020
[13]

Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kuk- liansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. TRUE: Re-evaluating factual consistency evaluation.arXiv preprint arXiv:2204.04991(2022)

work page arXiv 2022
[14]

Ben Kabongo and Vincent Guigue. 2025. On the Factual Consistency of Text- based Explainable Recommendation Models.arXiv preprint arXiv:2512.24366 (2025)

work page arXiv 2025
[15]

Ben Kabongo, Vincent Guigue, and Pirmin Lemberger. 2025. ELIXIR: Effi- cient and LIghtweight model for eXplaIning Recommendations.arXiv preprint arXiv:2508.20312(2025). Rank, Don’t Generate: Statement-level Ranking for Explainable Recommendation Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

work page arXiv 2025
[16]

Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. 2022. Sum- maC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics10 (2022), 163–177

work page 2022
[17]

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Lei Li, Yongfeng Zhang, and Li Chen. 2020. Generate neural template explanations for recommendation. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 755–764

work page 2020
[19]

Lei Li, Yongfeng Zhang, and Li Chen. 2021. Extra: Explanation ranking datasets for explainable recommendation. InProceedings of the 44th International ACM SIGIR conference on Research and Development in Information Retrieval. 2463– 2469

work page 2021
[20]

Lei Li, Yongfeng Zhang, and Li Chen. 2021. Personalized transformer for explain- able recommendation.arXiv preprint arXiv:2105.11601(2021)

work page arXiv 2021
[21]

Lei Li, Yongfeng Zhang, and Li Chen. 2023. On the relationship between ex- planation and recommendation: Learning to rank explanations for improved performance.ACM Transactions on Intelligent Systems and Technology14, 2 (2023), 1–24

work page 2023
[22]

Lei Li, Yongfeng Zhang, and Li Chen. 2023. Personalized prompt learning for explainable recommendation.ACM Transactions on Information Systems41, 4 (2023), 1–26

work page 2023
[23]

Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. Neural rating regression with abstractive tips generation for recommendation. InProceed- ings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. 345–354

work page 2017
[24]

Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, and Jia Li. 2025. G-refer: Graph retrieval-augmented large language model for explainable recommendation. InProceedings of the ACM on Web Conference 2025. 240–251

work page 2025
[25]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

work page 2004
[26]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Jianxin Ma, Chang Zhou, Peng Cui, Hongxia Yang, and Wenwu Zhu. 2019. Learn- ing disentangled representations for recommendation.Advances in neural infor- mation processing systems32 (2019)

work page 2019
[28]

Qiyao Ma, Xubin Ren, and Chao Huang. 2024. Xrec: Large language models for explainable recommendation.arXiv preprint arXiv:2406.02377(2024)

work page arXiv 2024
[29]

Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 188–197

work page 2019
[30]

Julio-Omar Palacio-Niño and Fernando Berzal. 2019. Evaluation metrics for unsupervised learning algorithms.arXiv preprint arXiv:1905.05667(2019)

work page arXiv 2019
[31]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

work page 2002
[32]

Jakub Raczyński, Mateusz Lango, and Jerzy Stefanowski. 2023. The Problem of Coherence in Natural Language Explanations of Recommendations. InECAI

work page 2023
[33]

IOS Press, 1922–1929

work page 1922
[34]

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al

work page
[35]

Recommender systems with generative retrieval.Advances in Neural Information Processing Systems36 (2023), 10299–10315

work page 2023
[36]

Thibault Sellam, Dipanjan Das, and Ankur P Parikh. 2020. BLEURT: Learning robust metrics for text generation.arXiv preprint arXiv:2004.04696(2020)

work page arXiv 2020
[37]

Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, and Jian Kang. 2025. An- alyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 11297–11339

work page 2025
[38]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT good at search? investigat- ing large language models as re-ranking agents.arXiv preprint arXiv:2304.09542 (2023)

work page arXiv 2023
[39]

Yiyi Tao, Yiling Jia, Nan Wang, and Hongning Wang. 2019. The fact: Taming latent factor models for explainability with factorization trees. InProceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 295–304

work page 2019
[40]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Nan Wang, Hongning Wang, Yiling Jia, and Yue Yin. 2018. Explainable recommen- dation via multi-task learning in opinionated text data. InThe 41st international ACM SIGIR conference on research & development in information retrieval. 165–174

work page 2018
[42]

Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, Guoyin Wang, and Chen Guo. 2025. Gpt-ner: Named entity recogni- tion via large language models. InFindings of the association for computational linguistics: NAACL 2025. 4257–4275

work page 2025
[43]

Zengzhi Wang, Qiming Xie, Yi Feng, Zixiang Ding, Zinong Yang, and Rui Xia

work page
[44]

Is ChatGPT a good sentiment analyzer? A preliminary study.arXiv preprint arXiv:2304.04339(2023)

work page arXiv 2023
[45]

Tianjun Wei, Tommy WS Chow, Jianghong Ma, and Mingbo Zhao. 2023. Expgcn: Review-aware graph convolution network for explainable recommendation.Neu- ral Networks157 (2023), 202–215

work page 2023
[46]

Zhouhang Xie, Sameer Singh, Julian McAuley, and Bodhisattwa Prasad Majumder

work page
[47]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Factual and informative review generation for explainable recommendation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 13816– 13824

work page
[48]

Yi Xu, Laura Ruis, Tim Rocktäschel, and Robert Kirk. 2025. Investigating Non- Transitivity in LLM-as-a-Judge.arXiv preprint arXiv:2502.14074(2025)

work page arXiv 2025
[49]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation.Advances in neural information processing systems34 (2021), 27263–27277

work page 2021
[51]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. 2024. Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics12 (2024), 39–57

work page 2024
[53]

Yongfeng Zhang, Xu Chen, et al. 2020. Explainable recommendation: A survey and new perspectives.Foundations and Trends®in Information Retrieval14, 1 (2020), 1–101

work page 2020
[54]

Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. 2014. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. InProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. 83–92

work page 2014
[55]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Yuting Zhang, Ying Sun, Fuzhen Zhuang, Yongchun Zhu, Zhulin An, and Yongjun Xu. 2023. Triple dual learning for opinion-based explainable recommendation. ACM Transactions on Information Systems42, 3 (2023), 1–27

work page 2023
[57]

Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. 2024. Dense text retrieval based on pretrained language models: A survey.ACM Transactions on Information Systems42, 4 (2024), 1–60

work page 2024
[58]

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. 2025. Kalm-embedding- v2: Superior training techniques and data inspire A versatile embedding model. arXiv preprint arXiv:2506.20923(2025)

work page arXiv 2025