On the Factual Consistency of Text-based Explainable Recommendation Models

Ben Kabongo; Vincent Guigue

arxiv: 2512.24366 · v2 · pith:6SRHLKGRnew · submitted 2025-12-30 · 💻 cs.IR

On the Factual Consistency of Text-based Explainable Recommendation Models

Ben Kabongo , Vincent Guigue This is my paper

Pith reviewed 2026-05-21 15:41 UTC · model grok-4.3

classification 💻 cs.IR

keywords factual consistencyexplainable recommendationtext-based explanationsLLM-based evaluationAmazon reviewsstatement-level metricsBERTScore

0 comments

The pith

Text-based explainable recommendation models achieve high semantic similarity to reviews yet show very low factual consistency with them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an evaluation framework that uses large language models to pull atomic factual statements out of user reviews and then checks whether generated explanations actually match those statements. It applies the approach to five Amazon review categories and six current models, producing new benchmarks for fine-grained checks. The experiments find strong BERTScore results alongside statement-level precision scores between roughly 4 and 33 percent. This gap matters because explanations are intended to increase trust and transparency; when they stray from the evidence they can instead mislead users. The work supplies both the metrics and the test sets needed to measure and improve factual reliability in future systems.

Core claim

We introduce a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews and thereby construct a ground truth focused on factual content. We apply the pipeline to five categories from the Amazon Reviews dataset to create augmented benchmarks and propose statement-level alignment metrics that combine LLM and NLI methods. Experiments across six state-of-the-art explainable recommendation models show high semantic similarity (BERTScore F1 0.81-0.90) but low factual performance (LLM-based statement-level precision 4.38%-32.88%).

What carries the argument

A prompting-based LLM pipeline that extracts atomic explanatory statements from reviews to build factual ground truth for alignment evaluation.

If this is right

Semantic similarity scores alone are insufficient for judging explanation quality in recommendation systems.
New training objectives or decoding strategies are needed to raise factual consistency without sacrificing fluency.
The created Amazon-derived benchmarks enable repeatable fine-grained testing of both consistency and relevance.
Factuality-aware metrics should be added to standard evaluation suites for text-based explainable recommenders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Persistent low factuality may gradually reduce user willingness to rely on explanations even when recommendations themselves remain accurate.
The same extraction-and-alignment approach could be adapted to check generated text in other high-stakes domains such as health or finance advice.
Models might benefit from explicit fact-verification modules that cross-check candidate explanations against the original review before output.

Load-bearing premise

The LLM pipeline that extracts atomic explanatory statements from reviews produces reliable and complete ground truth without its own errors or omissions.

What would settle it

Re-running the statement extraction step with several different large language models and checking whether the resulting factuality precision scores stay consistently low or change substantially.

Figures

Figures reproduced from arXiv: 2512.24366 by Ben Kabongo, Vincent Guigue.

**Figure 1.** Figure 1: BERTScore F1 vs. LLMbased statement-level metrics Precision (top) and Recall (bottom). Pearson correlation (r) reported in brackets. Among various explanation paradigms, text-based explainable recommendation has emerged as a particularly promising approach, leveraging the flexibility and expressiveness of natural language to convey personalized rationales [3, 15–18, 20]. Recent advances have increasingl… view at source ↗

**Figure 2.** Figure 2: Text Similarity Results on explanation generation (top) and review gen [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Text-based explainable recommendation aims to generate natural-language explanations that justify item recommendations, to improve user trust and system transparency. Although recent advances leverage LLMs to produce fluent outputs, a critical question remains underexplored: are these explanations factually consistent with the available evidence? We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders. We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews, thereby constructing a ground truth that isolates and focuses on their factual content. Applying this pipeline to five categories from the Amazon Reviews dataset, we create augmented benchmarks for fine-grained evaluation of explanation quality. We further propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations. Across extensive experiments on six state-of-the-art explainable recommendation models, we uncover a critical gap: while models achieve high semantic similarity scores (BERTScore F1: 0.81-0.90), all our factuality metrics reveal alarmingly low performance (LLM-based statement-level precision: 4.38%-32.88%). These findings underscore the need for factuality-aware evaluation in explainable recommendation and provide a foundation for developing more trustworthy explanation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a framework for assessing factual consistency in text-based explainable recommendation models. It employs an LLM prompting pipeline to extract atomic explanatory statements from reviews across five Amazon Reviews categories, forming ground-truth benchmarks, and evaluates six state-of-the-art models using combined LLM- and NLI-based statement-level alignment metrics. The central empirical finding is a pronounced gap: high semantic similarity (BERTScore F1 of 0.81–0.90) contrasted with low factual consistency (LLM-based statement-level precision of 4.38%–32.88%).

Significance. If the ground-truth construction and metrics prove reliable, the work usefully demonstrates that semantic overlap alone does not guarantee factual consistency in generated explanations, with direct implications for user trust in recommender systems. The augmented benchmarks and dual-metric approach constitute a concrete contribution to evaluation methodology in explainable recommendation.

major comments (2)

[prompting-based pipeline and ground-truth construction] The central claim of a large semantic-versus-factual gap rests on the LLM-based extraction pipeline producing a reliable, complete ground truth of atomic factual statements. However, the manuscript reports no human validation, inter-annotator agreement, or error analysis for this extraction step (described in the prompting-based pipeline and benchmark construction). Without such checks, systematic omissions, hallucinations, or inconsistent criteria across categories could artifactually depress the reported precision scores.
[statement-level alignment metrics] The statement-level alignment metrics (LLM- and NLI-based) are applied without reported controls for LLM judgment bias, prompt sensitivity, or inter-run variability. This is load-bearing because the low factuality numbers (4.38%–32.88%) are the primary evidence for the gap; any instability in the judge LLM would directly affect the strength of the conclusion.

minor comments (2)

[Abstract] The abstract states results on 'five categories' but does not name them; listing the specific Amazon categories (e.g., Books, Electronics) would improve reproducibility.
[Experimental results] Results tables would benefit from reporting the number of extracted statements per category and per model to allow readers to gauge the scale of the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. We agree that additional validation and robustness checks will strengthen the work and will incorporate them in the revision.

read point-by-point responses

Referee: [prompting-based pipeline and ground-truth construction] The central claim of a large semantic-versus-factual gap rests on the LLM-based extraction pipeline producing a reliable, complete ground truth of atomic factual statements. However, the manuscript reports no human validation, inter-annotator agreement, or error analysis for this extraction step (described in the prompting-based pipeline and benchmark construction). Without such checks, systematic omissions, hallucinations, or inconsistent criteria across categories could artifactually depress the reported precision scores.

Authors: We acknowledge that the current manuscript does not report a formal human validation study, inter-annotator agreement, or systematic error analysis for the LLM-based extraction pipeline. Although the pipeline was iteratively refined with detailed prompts and we performed informal manual inspections of samples during development, these steps were not documented with quantitative agreement metrics. We agree this is a substantive limitation that could affect confidence in the ground-truth statements. In the revised manuscript we will add a dedicated subsection describing a human evaluation: two independent annotators will label a stratified random sample of 200 extracted statements (40 per category) for factual accuracy and completeness. We will report Cohen's kappa for inter-annotator agreement and include a qualitative error analysis of any observed omissions or inconsistencies. These additions will directly address the concern that the low precision scores might be artifacts of the extraction process. revision: yes
Referee: [statement-level alignment metrics] The statement-level alignment metrics (LLM- and NLI-based) are applied without reported controls for LLM judgment bias, prompt sensitivity, or inter-run variability. This is load-bearing because the low factuality numbers (4.38%–32.88%) are the primary evidence for the gap; any instability in the judge LLM would directly affect the strength of the conclusion.

Authors: We concur that the absence of explicit controls for judgment bias, prompt sensitivity, and run-to-run variability weakens the evidential strength of the reported factuality scores. While the NLI component provides a deterministic baseline, the LLM-based metric is indeed sensitive to prompt wording and model choice. To mitigate this, the revised version will include a new robustness subsection with three sets of experiments: (1) prompt sensitivity tests using three alternative prompt phrasings and reporting the range of precision scores; (2) inter-run variability measured over five independent runs with different random seeds, including standard deviations and confidence intervals; and (3) a cross-judge comparison using a second LLM (e.g., GPT-4o-mini) on a subset of the data. These results will be presented alongside the main metrics to demonstrate that the observed gap remains consistent despite moderate variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation framework is self-contained

full rationale

The paper conducts an empirical evaluation of existing explainable recommendation models against a ground-truth constructed via an LLM prompting pipeline applied to an external public dataset (Amazon Reviews). No equations, fitted parameters, or self-referential definitions are present that would reduce the reported factuality scores (e.g., statement-level precision) to quantities defined inside the study itself. The semantic-vs-factual gap claim rests on independent comparisons using standard BERTScore, LLM-based, and NLI metrics rather than any derivation that collapses to the inputs by construction. Minor self-citations to prior recommendation work are not load-bearing for the central evaluation results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM extraction yields faithful atomic facts and that the proposed alignment metrics correctly quantify factual consistency; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption LLMs can reliably extract atomic factual statements from product reviews without systematic omission or hallucination
This premise underpins the ground-truth construction step described in the abstract.

pith-pipeline@v0.9.0 · 5748 in / 1280 out tokens · 33760 ms · 2026-05-21T15:41:05.455948+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews... statement-level alignment metrics that combine LLM- and NLI-based approaches

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

work page 2005
[3]

In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Dong, L., Huang, S., Wei, F., Lapata, M., Zhou, M., Xu, K.: Learning to generate product reviews from attributes. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. pp. 623–632 (2017)

work page 2017
[4]

arXiv e-prints pp

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

work page 2024
[5]

GPTScore: Evaluate as You Desire

Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=XPZIaotutsD

He, P., Liu, X., Gao, J., Chen, W.: Deberta: Deberta: Decoding-enhanced bert with disentangled attention. In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=XPZIaotutsD

work page 2021
[7]

In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval

He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., Wang, M.: Lightgcn: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. pp. 639–648 (2020)

work page 2020
[8]

arXiv preprint arXiv:2505.02235 (2025)

Herserant, T., Guigue, V.: Seval-ex: A statement-level framework for explainable summarization evaluation. arXiv preprint arXiv:2505.02235 (2025)

work page arXiv 2025
[9]

arXiv preprint arXiv:2204.04991 (2022)

Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kukliansy, D., Cohen, V., Scialom, T., Szpektor, I., Hassidim, A., Matias, Y.: True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991 (2022)

work page arXiv 2022
[10]

TrustLLM: Trustworthiness in Large Language Models

Huang, Y., Sun, L., Wang, H., Wu, S., Zhang, Q., Li, Y., Gao, C., Huang, Y., Lyu, W., Zhang, Y., et al.: Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

arXiv preprint arXiv:2508.20312 (2025)

Kabongo, B., Guigue, V., Lemberger, P.: Elixir: Efficient and lightweight model for explaining recommendations. arXiv preprint arXiv:2508.20312 (2025)

work page arXiv 2025
[12]

Computer42(8), 30–37 (2009)

Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer42(8), 30–37 (2009)

work page 2009
[13]

Transactions of the Association for Computational Linguistics10, 163–177 (2022) 12 B

Laban, P., Schnabel, T., Bennett, P.N., Hearst, M.A.: Summac: Re-visiting nli- based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics10, 163–177 (2022) 12 B. Kabongo and V. Guigue

work page 2022
[14]

In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Li, L., Zhang, Y., Chen, L.: Generate neural template explanations for recommen- dation. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. pp. 755–764 (2020)

work page 2020
[15]

arXiv preprint arXiv:2105.11601

Li, L., Zhang, Y., Chen, L.: Personalized transformer for explainable recommen- dation. arXiv preprint arXiv:2105.11601 (2021)

work page arXiv 2021
[16]

ACM Transactions on Information Systems41(4), 1–26 (2023)

Li, L., Zhang, Y., Chen, L.: Personalized prompt learning for explainable recom- mendation. ACM Transactions on Information Systems41(4), 1–26 (2023)

work page 2023
[17]

In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval

Li, P., Wang, Z., Ren, Z., Bing, L., Lam, W.: Neural rating regression with abstrac- tive tips generation for recommendation. In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. pp. 345–354 (2017)

work page 2017
[18]

In: Proceedings of the ACM on Web Conference 2025

Li, Y., Zhang, X., Luo, L., Chang, H., Ren, Y., King, I., Li, J.: G-refer: Graph retrieval-augmented large language model for explainable recommendation. In: Proceedings of the ACM on Web Conference 2025. pp. 240–251 (2025)

work page 2025
[19]

In: Text sum- marization branches out

Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

work page 2004
[20]

arXiv preprint arXiv:2406.02377 (2024)

Ma, Q., Ren, X., Huang, C.: Xrec: Large language models for explainable recom- mendation. arXiv preprint arXiv:2406.02377 (2024)

work page arXiv 2024
[21]

arXiv preprint arXiv:2305.14251 (2023)

Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.t., Koh, P.W., Iyyer, M., Zettle- moyer, L., Hajishirzi, H.: Factscore: Fine-grained atomic evaluation of factual pre- cision in long form text generation. arXiv preprint arXiv:2305.14251 (2023)

work page arXiv 2023
[22]

In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

Ni, J., Li, J., McAuley, J.: Justifying recommendations using distantly-labeled re- views and fine-grained aspects. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 188–197 (2019)

work page 2019
[23]

In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

work page 2002
[24]

In: ECAI 2023, pp

Raczyński, J., Lango, M., Stefanowski, J.: The problem of coherence in natural language explanations of recommendations. In: ECAI 2023, pp. 1922–1929. IOS Press (2023)

work page 2023
[25]

OpenAI blog1(8), 9 (2019)

Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

work page 2019
[26]

arXiv preprint arXiv:2103.12693 (2021)

Scialom, T., Dray, P.A., Gallinari, P., Lamprier, S., Piwowarski, B., Staiano, J., Wang, A.: Questeval: Summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693 (2021)

work page arXiv 2021
[27]

Bleurt: Learning robust metrics for text generation,

Sellam, T., Das, D., Parikh, A.P.: Bleurt: Learning robust metrics for text gener- ation. arXiv preprint arXiv:2004.04696 (2020)

work page arXiv 2004
[28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Advances in neural information pro- cessing systems30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

work page 2017
[30]

In: Proceedings of the AAAI Confer- ence on Artificial Intelligence

Xie, Z., Singh, S., McAuley, J., Majumder, B.P.: Factual and informative review generation for explainable recommendation. In: Proceedings of the AAAI Confer- ence on Artificial Intelligence. vol. 37, pp. 13816–13824 (2023)

work page 2023
[31]

Advances in neural information processing systems34, 27263–27277 (2021)

Yuan, W., Neubig, G., Liu, P.: Bartscore: Evaluating generated text as text gener- ation. Advances in neural information processing systems34, 27263–27277 (2021)

work page 2021
[32]

arXiv preprint arXiv:2305.16739 (2023) Factual Explainable Recommendation Framework 13

Zha, Y., Yang, Y., Li, R., Hu, Z.: Alignscore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739 (2023) Factual Explainable Recommendation Framework 13

work page arXiv 2023
[33]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[34]

Foundations and Trends®in Information Retrieval14(1), 1–101 (2020)

Zhang, Y., Chen, X., et al.: Explainable recommendation: A survey and new per- spectives. Foundations and Trends®in Information Retrieval14(1), 1–101 (2020)

work page 2020

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

work page 2005

[3] [3]

In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Dong, L., Huang, S., Wei, F., Lapata, M., Zhou, M., Xu, K.: Learning to generate product reviews from attributes. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. pp. 623–632 (2017)

work page 2017

[4] [4]

arXiv e-prints pp

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

work page 2024

[5] [5]

GPTScore: Evaluate as You Desire

Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=XPZIaotutsD

He, P., Liu, X., Gao, J., Chen, W.: Deberta: Deberta: Decoding-enhanced bert with disentangled attention. In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=XPZIaotutsD

work page 2021

[7] [7]

In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval

He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., Wang, M.: Lightgcn: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. pp. 639–648 (2020)

work page 2020

[8] [8]

arXiv preprint arXiv:2505.02235 (2025)

Herserant, T., Guigue, V.: Seval-ex: A statement-level framework for explainable summarization evaluation. arXiv preprint arXiv:2505.02235 (2025)

work page arXiv 2025

[9] [9]

arXiv preprint arXiv:2204.04991 (2022)

Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kukliansy, D., Cohen, V., Scialom, T., Szpektor, I., Hassidim, A., Matias, Y.: True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991 (2022)

work page arXiv 2022

[10] [10]

TrustLLM: Trustworthiness in Large Language Models

Huang, Y., Sun, L., Wang, H., Wu, S., Zhang, Q., Li, Y., Gao, C., Huang, Y., Lyu, W., Zhang, Y., et al.: Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

arXiv preprint arXiv:2508.20312 (2025)

Kabongo, B., Guigue, V., Lemberger, P.: Elixir: Efficient and lightweight model for explaining recommendations. arXiv preprint arXiv:2508.20312 (2025)

work page arXiv 2025

[12] [12]

Computer42(8), 30–37 (2009)

Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer42(8), 30–37 (2009)

work page 2009

[13] [13]

Transactions of the Association for Computational Linguistics10, 163–177 (2022) 12 B

Laban, P., Schnabel, T., Bennett, P.N., Hearst, M.A.: Summac: Re-visiting nli- based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics10, 163–177 (2022) 12 B. Kabongo and V. Guigue

work page 2022

[14] [14]

In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Li, L., Zhang, Y., Chen, L.: Generate neural template explanations for recommen- dation. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. pp. 755–764 (2020)

work page 2020

[15] [15]

arXiv preprint arXiv:2105.11601

Li, L., Zhang, Y., Chen, L.: Personalized transformer for explainable recommen- dation. arXiv preprint arXiv:2105.11601 (2021)

work page arXiv 2021

[16] [16]

ACM Transactions on Information Systems41(4), 1–26 (2023)

Li, L., Zhang, Y., Chen, L.: Personalized prompt learning for explainable recom- mendation. ACM Transactions on Information Systems41(4), 1–26 (2023)

work page 2023

[17] [17]

In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval

Li, P., Wang, Z., Ren, Z., Bing, L., Lam, W.: Neural rating regression with abstrac- tive tips generation for recommendation. In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. pp. 345–354 (2017)

work page 2017

[18] [18]

In: Proceedings of the ACM on Web Conference 2025

Li, Y., Zhang, X., Luo, L., Chang, H., Ren, Y., King, I., Li, J.: G-refer: Graph retrieval-augmented large language model for explainable recommendation. In: Proceedings of the ACM on Web Conference 2025. pp. 240–251 (2025)

work page 2025

[19] [19]

In: Text sum- marization branches out

Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

work page 2004

[20] [20]

arXiv preprint arXiv:2406.02377 (2024)

Ma, Q., Ren, X., Huang, C.: Xrec: Large language models for explainable recom- mendation. arXiv preprint arXiv:2406.02377 (2024)

work page arXiv 2024

[21] [21]

arXiv preprint arXiv:2305.14251 (2023)

Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.t., Koh, P.W., Iyyer, M., Zettle- moyer, L., Hajishirzi, H.: Factscore: Fine-grained atomic evaluation of factual pre- cision in long form text generation. arXiv preprint arXiv:2305.14251 (2023)

work page arXiv 2023

[22] [22]

In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

Ni, J., Li, J., McAuley, J.: Justifying recommendations using distantly-labeled re- views and fine-grained aspects. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 188–197 (2019)

work page 2019

[23] [23]

In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

work page 2002

[24] [24]

In: ECAI 2023, pp

Raczyński, J., Lango, M., Stefanowski, J.: The problem of coherence in natural language explanations of recommendations. In: ECAI 2023, pp. 1922–1929. IOS Press (2023)

work page 2023

[25] [25]

OpenAI blog1(8), 9 (2019)

Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

work page 2019

[26] [26]

arXiv preprint arXiv:2103.12693 (2021)

Scialom, T., Dray, P.A., Gallinari, P., Lamprier, S., Piwowarski, B., Staiano, J., Wang, A.: Questeval: Summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693 (2021)

work page arXiv 2021

[27] [27]

Bleurt: Learning robust metrics for text generation,

Sellam, T., Das, D., Parikh, A.P.: Bleurt: Learning robust metrics for text gener- ation. arXiv preprint arXiv:2004.04696 (2020)

work page arXiv 2004

[28] [28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Advances in neural information pro- cessing systems30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

work page 2017

[30] [30]

In: Proceedings of the AAAI Confer- ence on Artificial Intelligence

Xie, Z., Singh, S., McAuley, J., Majumder, B.P.: Factual and informative review generation for explainable recommendation. In: Proceedings of the AAAI Confer- ence on Artificial Intelligence. vol. 37, pp. 13816–13824 (2023)

work page 2023

[31] [31]

Advances in neural information processing systems34, 27263–27277 (2021)

Yuan, W., Neubig, G., Liu, P.: Bartscore: Evaluating generated text as text gener- ation. Advances in neural information processing systems34, 27263–27277 (2021)

work page 2021

[32] [32]

arXiv preprint arXiv:2305.16739 (2023) Factual Explainable Recommendation Framework 13

Zha, Y., Yang, Y., Li, R., Hu, Z.: Alignscore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739 (2023) Factual Explainable Recommendation Framework 13

work page arXiv 2023

[33] [33]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[34] [34]

Foundations and Trends®in Information Retrieval14(1), 1–101 (2020)

Zhang, Y., Chen, X., et al.: Explainable recommendation: A survey and new per- spectives. Foundations and Trends®in Information Retrieval14(1), 1–101 (2020)

work page 2020