On the Factual Consistency of Text-based Explainable Recommendation Models
Pith reviewed 2026-05-21 15:41 UTC · model grok-4.3
The pith
Text-based explainable recommendation models achieve high semantic similarity to reviews yet show very low factual consistency with them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews and thereby construct a ground truth focused on factual content. We apply the pipeline to five categories from the Amazon Reviews dataset to create augmented benchmarks and propose statement-level alignment metrics that combine LLM and NLI methods. Experiments across six state-of-the-art explainable recommendation models show high semantic similarity (BERTScore F1 0.81-0.90) but low factual performance (LLM-based statement-level precision 4.38%-32.88%).
What carries the argument
A prompting-based LLM pipeline that extracts atomic explanatory statements from reviews to build factual ground truth for alignment evaluation.
If this is right
- Semantic similarity scores alone are insufficient for judging explanation quality in recommendation systems.
- New training objectives or decoding strategies are needed to raise factual consistency without sacrificing fluency.
- The created Amazon-derived benchmarks enable repeatable fine-grained testing of both consistency and relevance.
- Factuality-aware metrics should be added to standard evaluation suites for text-based explainable recommenders.
Where Pith is reading between the lines
- Persistent low factuality may gradually reduce user willingness to rely on explanations even when recommendations themselves remain accurate.
- The same extraction-and-alignment approach could be adapted to check generated text in other high-stakes domains such as health or finance advice.
- Models might benefit from explicit fact-verification modules that cross-check candidate explanations against the original review before output.
Load-bearing premise
The LLM pipeline that extracts atomic explanatory statements from reviews produces reliable and complete ground truth without its own errors or omissions.
What would settle it
Re-running the statement extraction step with several different large language models and checking whether the resulting factuality precision scores stay consistently low or change substantially.
Figures
read the original abstract
Text-based explainable recommendation aims to generate natural-language explanations that justify item recommendations, to improve user trust and system transparency. Although recent advances leverage LLMs to produce fluent outputs, a critical question remains underexplored: are these explanations factually consistent with the available evidence? We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders. We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews, thereby constructing a ground truth that isolates and focuses on their factual content. Applying this pipeline to five categories from the Amazon Reviews dataset, we create augmented benchmarks for fine-grained evaluation of explanation quality. We further propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations. Across extensive experiments on six state-of-the-art explainable recommendation models, we uncover a critical gap: while models achieve high semantic similarity scores (BERTScore F1: 0.81-0.90), all our factuality metrics reveal alarmingly low performance (LLM-based statement-level precision: 4.38%-32.88%). These findings underscore the need for factuality-aware evaluation in explainable recommendation and provide a foundation for developing more trustworthy explanation systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a framework for assessing factual consistency in text-based explainable recommendation models. It employs an LLM prompting pipeline to extract atomic explanatory statements from reviews across five Amazon Reviews categories, forming ground-truth benchmarks, and evaluates six state-of-the-art models using combined LLM- and NLI-based statement-level alignment metrics. The central empirical finding is a pronounced gap: high semantic similarity (BERTScore F1 of 0.81–0.90) contrasted with low factual consistency (LLM-based statement-level precision of 4.38%–32.88%).
Significance. If the ground-truth construction and metrics prove reliable, the work usefully demonstrates that semantic overlap alone does not guarantee factual consistency in generated explanations, with direct implications for user trust in recommender systems. The augmented benchmarks and dual-metric approach constitute a concrete contribution to evaluation methodology in explainable recommendation.
major comments (2)
- [prompting-based pipeline and ground-truth construction] The central claim of a large semantic-versus-factual gap rests on the LLM-based extraction pipeline producing a reliable, complete ground truth of atomic factual statements. However, the manuscript reports no human validation, inter-annotator agreement, or error analysis for this extraction step (described in the prompting-based pipeline and benchmark construction). Without such checks, systematic omissions, hallucinations, or inconsistent criteria across categories could artifactually depress the reported precision scores.
- [statement-level alignment metrics] The statement-level alignment metrics (LLM- and NLI-based) are applied without reported controls for LLM judgment bias, prompt sensitivity, or inter-run variability. This is load-bearing because the low factuality numbers (4.38%–32.88%) are the primary evidence for the gap; any instability in the judge LLM would directly affect the strength of the conclusion.
minor comments (2)
- [Abstract] The abstract states results on 'five categories' but does not name them; listing the specific Amazon categories (e.g., Books, Electronics) would improve reproducibility.
- [Experimental results] Results tables would benefit from reporting the number of extracted statements per category and per model to allow readers to gauge the scale of the evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. We agree that additional validation and robustness checks will strengthen the work and will incorporate them in the revision.
read point-by-point responses
-
Referee: [prompting-based pipeline and ground-truth construction] The central claim of a large semantic-versus-factual gap rests on the LLM-based extraction pipeline producing a reliable, complete ground truth of atomic factual statements. However, the manuscript reports no human validation, inter-annotator agreement, or error analysis for this extraction step (described in the prompting-based pipeline and benchmark construction). Without such checks, systematic omissions, hallucinations, or inconsistent criteria across categories could artifactually depress the reported precision scores.
Authors: We acknowledge that the current manuscript does not report a formal human validation study, inter-annotator agreement, or systematic error analysis for the LLM-based extraction pipeline. Although the pipeline was iteratively refined with detailed prompts and we performed informal manual inspections of samples during development, these steps were not documented with quantitative agreement metrics. We agree this is a substantive limitation that could affect confidence in the ground-truth statements. In the revised manuscript we will add a dedicated subsection describing a human evaluation: two independent annotators will label a stratified random sample of 200 extracted statements (40 per category) for factual accuracy and completeness. We will report Cohen's kappa for inter-annotator agreement and include a qualitative error analysis of any observed omissions or inconsistencies. These additions will directly address the concern that the low precision scores might be artifacts of the extraction process. revision: yes
-
Referee: [statement-level alignment metrics] The statement-level alignment metrics (LLM- and NLI-based) are applied without reported controls for LLM judgment bias, prompt sensitivity, or inter-run variability. This is load-bearing because the low factuality numbers (4.38%–32.88%) are the primary evidence for the gap; any instability in the judge LLM would directly affect the strength of the conclusion.
Authors: We concur that the absence of explicit controls for judgment bias, prompt sensitivity, and run-to-run variability weakens the evidential strength of the reported factuality scores. While the NLI component provides a deterministic baseline, the LLM-based metric is indeed sensitive to prompt wording and model choice. To mitigate this, the revised version will include a new robustness subsection with three sets of experiments: (1) prompt sensitivity tests using three alternative prompt phrasings and reporting the range of precision scores; (2) inter-run variability measured over five independent runs with different random seeds, including standard deviations and confidence intervals; and (3) a cross-judge comparison using a second LLM (e.g., GPT-4o-mini) on a subset of the data. These results will be presented alongside the main metrics to demonstrate that the observed gap remains consistent despite moderate variability. revision: yes
Circularity Check
No circularity: empirical evaluation framework is self-contained
full rationale
The paper conducts an empirical evaluation of existing explainable recommendation models against a ground-truth constructed via an LLM prompting pipeline applied to an external public dataset (Amazon Reviews). No equations, fitted parameters, or self-referential definitions are present that would reduce the reported factuality scores (e.g., statement-level precision) to quantities defined inside the study itself. The semantic-vs-factual gap claim rests on independent comparisons using standard BERTScore, LLM-based, and NLI metrics rather than any derivation that collapses to the inputs by construction. Minor self-citations to prior recommendation work are not load-bearing for the central evaluation results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can reliably extract atomic factual statements from product reviews without systematic omission or hallucination
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews... statement-level alignment metrics that combine LLM- and NLI-based approaches
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)
work page 2005
-
[3]
Dong, L., Huang, S., Wei, F., Lapata, M., Zhou, M., Xu, K.: Learning to generate product reviews from attributes. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. pp. 623–632 (2017)
work page 2017
-
[4]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)
work page 2024
-
[5]
GPTScore: Evaluate as You Desire
Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
He, P., Liu, X., Gao, J., Chen, W.: Deberta: Deberta: Decoding-enhanced bert with disentangled attention. In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=XPZIaotutsD
work page 2021
-
[7]
He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., Wang, M.: Lightgcn: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. pp. 639–648 (2020)
work page 2020
-
[8]
arXiv preprint arXiv:2505.02235 (2025)
Herserant, T., Guigue, V.: Seval-ex: A statement-level framework for explainable summarization evaluation. arXiv preprint arXiv:2505.02235 (2025)
-
[9]
arXiv preprint arXiv:2204.04991 (2022)
Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kukliansy, D., Cohen, V., Scialom, T., Szpektor, I., Hassidim, A., Matias, Y.: True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991 (2022)
-
[10]
TrustLLM: Trustworthiness in Large Language Models
Huang, Y., Sun, L., Wang, H., Wu, S., Zhang, Q., Li, Y., Gao, C., Huang, Y., Lyu, W., Zhang, Y., et al.: Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
arXiv preprint arXiv:2508.20312 (2025)
Kabongo, B., Guigue, V., Lemberger, P.: Elixir: Efficient and lightweight model for explaining recommendations. arXiv preprint arXiv:2508.20312 (2025)
-
[12]
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer42(8), 30–37 (2009)
work page 2009
-
[13]
Transactions of the Association for Computational Linguistics10, 163–177 (2022) 12 B
Laban, P., Schnabel, T., Bennett, P.N., Hearst, M.A.: Summac: Re-visiting nli- based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics10, 163–177 (2022) 12 B. Kabongo and V. Guigue
work page 2022
-
[14]
In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
Li, L., Zhang, Y., Chen, L.: Generate neural template explanations for recommen- dation. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. pp. 755–764 (2020)
work page 2020
-
[15]
arXiv preprint arXiv:2105.11601
Li, L., Zhang, Y., Chen, L.: Personalized transformer for explainable recommen- dation. arXiv preprint arXiv:2105.11601 (2021)
-
[16]
ACM Transactions on Information Systems41(4), 1–26 (2023)
Li, L., Zhang, Y., Chen, L.: Personalized prompt learning for explainable recom- mendation. ACM Transactions on Information Systems41(4), 1–26 (2023)
work page 2023
-
[17]
Li, P., Wang, Z., Ren, Z., Bing, L., Lam, W.: Neural rating regression with abstrac- tive tips generation for recommendation. In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. pp. 345–354 (2017)
work page 2017
-
[18]
In: Proceedings of the ACM on Web Conference 2025
Li, Y., Zhang, X., Luo, L., Chang, H., Ren, Y., King, I., Li, J.: G-refer: Graph retrieval-augmented large language model for explainable recommendation. In: Proceedings of the ACM on Web Conference 2025. pp. 240–251 (2025)
work page 2025
-
[19]
In: Text sum- marization branches out
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)
work page 2004
-
[20]
arXiv preprint arXiv:2406.02377 (2024)
Ma, Q., Ren, X., Huang, C.: Xrec: Large language models for explainable recom- mendation. arXiv preprint arXiv:2406.02377 (2024)
-
[21]
arXiv preprint arXiv:2305.14251 (2023)
Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.t., Koh, P.W., Iyyer, M., Zettle- moyer, L., Hajishirzi, H.: Factscore: Fine-grained atomic evaluation of factual pre- cision in long form text generation. arXiv preprint arXiv:2305.14251 (2023)
-
[22]
Ni, J., Li, J., McAuley, J.: Justifying recommendations using distantly-labeled re- views and fine-grained aspects. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 188–197 (2019)
work page 2019
-
[23]
In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
work page 2002
-
[24]
Raczyński, J., Lango, M., Stefanowski, J.: The problem of coherence in natural language explanations of recommendations. In: ECAI 2023, pp. 1922–1929. IOS Press (2023)
work page 2023
-
[25]
Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)
work page 2019
-
[26]
arXiv preprint arXiv:2103.12693 (2021)
Scialom, T., Dray, P.A., Gallinari, P., Lamprier, S., Piwowarski, B., Staiano, J., Wang, A.: Questeval: Summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693 (2021)
-
[27]
Bleurt: Learning robust metrics for text generation,
Sellam, T., Das, D., Parikh, A.P.: Bleurt: Learning robust metrics for text gener- ation. arXiv preprint arXiv:2004.04696 (2020)
-
[28]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Advances in neural information pro- cessing systems30(2017)
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)
work page 2017
-
[30]
In: Proceedings of the AAAI Confer- ence on Artificial Intelligence
Xie, Z., Singh, S., McAuley, J., Majumder, B.P.: Factual and informative review generation for explainable recommendation. In: Proceedings of the AAAI Confer- ence on Artificial Intelligence. vol. 37, pp. 13816–13824 (2023)
work page 2023
-
[31]
Advances in neural information processing systems34, 27263–27277 (2021)
Yuan, W., Neubig, G., Liu, P.: Bartscore: Evaluating generated text as text gener- ation. Advances in neural information processing systems34, 27263–27277 (2021)
work page 2021
-
[32]
arXiv preprint arXiv:2305.16739 (2023) Factual Explainable Recommendation Framework 13
Zha, Y., Yang, Y., Li, R., Hu, Z.: Alignscore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739 (2023) Factual Explainable Recommendation Framework 13
-
[33]
BERTScore: Evaluating Text Generation with BERT
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[34]
Foundations and Trends®in Information Retrieval14(1), 1–101 (2020)
Zhang, Y., Chen, X., et al.: Explainable recommendation: A survey and new per- spectives. Foundations and Trends®in Information Retrieval14(1), 1–101 (2020)
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.