pith. sign in

arxiv: 2512.24366 · v2 · pith:6SRHLKGRnew · submitted 2025-12-30 · 💻 cs.IR

On the Factual Consistency of Text-based Explainable Recommendation Models

Pith reviewed 2026-05-21 15:41 UTC · model grok-4.3

classification 💻 cs.IR
keywords factual consistencyexplainable recommendationtext-based explanationsLLM-based evaluationAmazon reviewsstatement-level metricsBERTScore
0
0 comments X

The pith

Text-based explainable recommendation models achieve high semantic similarity to reviews yet show very low factual consistency with them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an evaluation framework that uses large language models to pull atomic factual statements out of user reviews and then checks whether generated explanations actually match those statements. It applies the approach to five Amazon review categories and six current models, producing new benchmarks for fine-grained checks. The experiments find strong BERTScore results alongside statement-level precision scores between roughly 4 and 33 percent. This gap matters because explanations are intended to increase trust and transparency; when they stray from the evidence they can instead mislead users. The work supplies both the metrics and the test sets needed to measure and improve factual reliability in future systems.

Core claim

We introduce a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews and thereby construct a ground truth focused on factual content. We apply the pipeline to five categories from the Amazon Reviews dataset to create augmented benchmarks and propose statement-level alignment metrics that combine LLM and NLI methods. Experiments across six state-of-the-art explainable recommendation models show high semantic similarity (BERTScore F1 0.81-0.90) but low factual performance (LLM-based statement-level precision 4.38%-32.88%).

What carries the argument

A prompting-based LLM pipeline that extracts atomic explanatory statements from reviews to build factual ground truth for alignment evaluation.

If this is right

  • Semantic similarity scores alone are insufficient for judging explanation quality in recommendation systems.
  • New training objectives or decoding strategies are needed to raise factual consistency without sacrificing fluency.
  • The created Amazon-derived benchmarks enable repeatable fine-grained testing of both consistency and relevance.
  • Factuality-aware metrics should be added to standard evaluation suites for text-based explainable recommenders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Persistent low factuality may gradually reduce user willingness to rely on explanations even when recommendations themselves remain accurate.
  • The same extraction-and-alignment approach could be adapted to check generated text in other high-stakes domains such as health or finance advice.
  • Models might benefit from explicit fact-verification modules that cross-check candidate explanations against the original review before output.

Load-bearing premise

The LLM pipeline that extracts atomic explanatory statements from reviews produces reliable and complete ground truth without its own errors or omissions.

What would settle it

Re-running the statement extraction step with several different large language models and checking whether the resulting factuality precision scores stay consistently low or change substantially.

Figures

Figures reproduced from arXiv: 2512.24366 by Ben Kabongo, Vincent Guigue.

Figure 1
Figure 1. Figure 1: BERTScore F1 vs. LLM￾based statement-level metrics Pre￾cision (top) and Recall (bottom). Pearson correlation (r) reported in brackets. Among various explanation paradigms, text-based explainable recommendation has emerged as a particularly promising approach, leveraging the flexibility and expressiveness of natural language to con￾vey personalized rationales [3, 15–18, 20]. Recent advances have increasingl… view at source ↗
Figure 2
Figure 2. Figure 2: Text Similarity Results on explanation generation (top) and review gen [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Text-based explainable recommendation aims to generate natural-language explanations that justify item recommendations, to improve user trust and system transparency. Although recent advances leverage LLMs to produce fluent outputs, a critical question remains underexplored: are these explanations factually consistent with the available evidence? We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders. We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews, thereby constructing a ground truth that isolates and focuses on their factual content. Applying this pipeline to five categories from the Amazon Reviews dataset, we create augmented benchmarks for fine-grained evaluation of explanation quality. We further propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations. Across extensive experiments on six state-of-the-art explainable recommendation models, we uncover a critical gap: while models achieve high semantic similarity scores (BERTScore F1: 0.81-0.90), all our factuality metrics reveal alarmingly low performance (LLM-based statement-level precision: 4.38%-32.88%). These findings underscore the need for factuality-aware evaluation in explainable recommendation and provide a foundation for developing more trustworthy explanation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a framework for assessing factual consistency in text-based explainable recommendation models. It employs an LLM prompting pipeline to extract atomic explanatory statements from reviews across five Amazon Reviews categories, forming ground-truth benchmarks, and evaluates six state-of-the-art models using combined LLM- and NLI-based statement-level alignment metrics. The central empirical finding is a pronounced gap: high semantic similarity (BERTScore F1 of 0.81–0.90) contrasted with low factual consistency (LLM-based statement-level precision of 4.38%–32.88%).

Significance. If the ground-truth construction and metrics prove reliable, the work usefully demonstrates that semantic overlap alone does not guarantee factual consistency in generated explanations, with direct implications for user trust in recommender systems. The augmented benchmarks and dual-metric approach constitute a concrete contribution to evaluation methodology in explainable recommendation.

major comments (2)
  1. [prompting-based pipeline and ground-truth construction] The central claim of a large semantic-versus-factual gap rests on the LLM-based extraction pipeline producing a reliable, complete ground truth of atomic factual statements. However, the manuscript reports no human validation, inter-annotator agreement, or error analysis for this extraction step (described in the prompting-based pipeline and benchmark construction). Without such checks, systematic omissions, hallucinations, or inconsistent criteria across categories could artifactually depress the reported precision scores.
  2. [statement-level alignment metrics] The statement-level alignment metrics (LLM- and NLI-based) are applied without reported controls for LLM judgment bias, prompt sensitivity, or inter-run variability. This is load-bearing because the low factuality numbers (4.38%–32.88%) are the primary evidence for the gap; any instability in the judge LLM would directly affect the strength of the conclusion.
minor comments (2)
  1. [Abstract] The abstract states results on 'five categories' but does not name them; listing the specific Amazon categories (e.g., Books, Electronics) would improve reproducibility.
  2. [Experimental results] Results tables would benefit from reporting the number of extracted statements per category and per model to allow readers to gauge the scale of the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. We agree that additional validation and robustness checks will strengthen the work and will incorporate them in the revision.

read point-by-point responses
  1. Referee: [prompting-based pipeline and ground-truth construction] The central claim of a large semantic-versus-factual gap rests on the LLM-based extraction pipeline producing a reliable, complete ground truth of atomic factual statements. However, the manuscript reports no human validation, inter-annotator agreement, or error analysis for this extraction step (described in the prompting-based pipeline and benchmark construction). Without such checks, systematic omissions, hallucinations, or inconsistent criteria across categories could artifactually depress the reported precision scores.

    Authors: We acknowledge that the current manuscript does not report a formal human validation study, inter-annotator agreement, or systematic error analysis for the LLM-based extraction pipeline. Although the pipeline was iteratively refined with detailed prompts and we performed informal manual inspections of samples during development, these steps were not documented with quantitative agreement metrics. We agree this is a substantive limitation that could affect confidence in the ground-truth statements. In the revised manuscript we will add a dedicated subsection describing a human evaluation: two independent annotators will label a stratified random sample of 200 extracted statements (40 per category) for factual accuracy and completeness. We will report Cohen's kappa for inter-annotator agreement and include a qualitative error analysis of any observed omissions or inconsistencies. These additions will directly address the concern that the low precision scores might be artifacts of the extraction process. revision: yes

  2. Referee: [statement-level alignment metrics] The statement-level alignment metrics (LLM- and NLI-based) are applied without reported controls for LLM judgment bias, prompt sensitivity, or inter-run variability. This is load-bearing because the low factuality numbers (4.38%–32.88%) are the primary evidence for the gap; any instability in the judge LLM would directly affect the strength of the conclusion.

    Authors: We concur that the absence of explicit controls for judgment bias, prompt sensitivity, and run-to-run variability weakens the evidential strength of the reported factuality scores. While the NLI component provides a deterministic baseline, the LLM-based metric is indeed sensitive to prompt wording and model choice. To mitigate this, the revised version will include a new robustness subsection with three sets of experiments: (1) prompt sensitivity tests using three alternative prompt phrasings and reporting the range of precision scores; (2) inter-run variability measured over five independent runs with different random seeds, including standard deviations and confidence intervals; and (3) a cross-judge comparison using a second LLM (e.g., GPT-4o-mini) on a subset of the data. These results will be presented alongside the main metrics to demonstrate that the observed gap remains consistent despite moderate variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation framework is self-contained

full rationale

The paper conducts an empirical evaluation of existing explainable recommendation models against a ground-truth constructed via an LLM prompting pipeline applied to an external public dataset (Amazon Reviews). No equations, fitted parameters, or self-referential definitions are present that would reduce the reported factuality scores (e.g., statement-level precision) to quantities defined inside the study itself. The semantic-vs-factual gap claim rests on independent comparisons using standard BERTScore, LLM-based, and NLI metrics rather than any derivation that collapses to the inputs by construction. Minor self-citations to prior recommendation work are not load-bearing for the central evaluation results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM extraction yields faithful atomic facts and that the proposed alignment metrics correctly quantify factual consistency; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption LLMs can reliably extract atomic factual statements from product reviews without systematic omission or hallucination
    This premise underpins the ground-truth construction step described in the abstract.

pith-pipeline@v0.9.0 · 5748 in / 1280 out tokens · 33760 ms · 2026-05-21T15:41:05.455948+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

    Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

  3. [3]

    In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

    Dong, L., Huang, S., Wei, F., Lapata, M., Zhou, M., Xu, K.: Learning to generate product reviews from attributes. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. pp. 623–632 (2017)

  4. [4]

    arXiv e-prints pp

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

  5. [5]

    GPTScore: Evaluate as You Desire

    Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023)

  6. [6]

    In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=XPZIaotutsD

    He, P., Liu, X., Gao, J., Chen, W.: Deberta: Deberta: Decoding-enhanced bert with disentangled attention. In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=XPZIaotutsD

  7. [7]

    In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval

    He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., Wang, M.: Lightgcn: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. pp. 639–648 (2020)

  8. [8]

    arXiv preprint arXiv:2505.02235 (2025)

    Herserant, T., Guigue, V.: Seval-ex: A statement-level framework for explainable summarization evaluation. arXiv preprint arXiv:2505.02235 (2025)

  9. [9]

    arXiv preprint arXiv:2204.04991 (2022)

    Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kukliansy, D., Cohen, V., Scialom, T., Szpektor, I., Hassidim, A., Matias, Y.: True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991 (2022)

  10. [10]

    TrustLLM: Trustworthiness in Large Language Models

    Huang, Y., Sun, L., Wang, H., Wu, S., Zhang, Q., Li, Y., Gao, C., Huang, Y., Lyu, W., Zhang, Y., et al.: Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024)

  11. [11]

    arXiv preprint arXiv:2508.20312 (2025)

    Kabongo, B., Guigue, V., Lemberger, P.: Elixir: Efficient and lightweight model for explaining recommendations. arXiv preprint arXiv:2508.20312 (2025)

  12. [12]

    Computer42(8), 30–37 (2009)

    Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer42(8), 30–37 (2009)

  13. [13]

    Transactions of the Association for Computational Linguistics10, 163–177 (2022) 12 B

    Laban, P., Schnabel, T., Bennett, P.N., Hearst, M.A.: Summac: Re-visiting nli- based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics10, 163–177 (2022) 12 B. Kabongo and V. Guigue

  14. [14]

    In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

    Li, L., Zhang, Y., Chen, L.: Generate neural template explanations for recommen- dation. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management. pp. 755–764 (2020)

  15. [15]

    arXiv preprint arXiv:2105.11601

    Li, L., Zhang, Y., Chen, L.: Personalized transformer for explainable recommen- dation. arXiv preprint arXiv:2105.11601 (2021)

  16. [16]

    ACM Transactions on Information Systems41(4), 1–26 (2023)

    Li, L., Zhang, Y., Chen, L.: Personalized prompt learning for explainable recom- mendation. ACM Transactions on Information Systems41(4), 1–26 (2023)

  17. [17]

    In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval

    Li, P., Wang, Z., Ren, Z., Bing, L., Lam, W.: Neural rating regression with abstrac- tive tips generation for recommendation. In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. pp. 345–354 (2017)

  18. [18]

    In: Proceedings of the ACM on Web Conference 2025

    Li, Y., Zhang, X., Luo, L., Chang, H., Ren, Y., King, I., Li, J.: G-refer: Graph retrieval-augmented large language model for explainable recommendation. In: Proceedings of the ACM on Web Conference 2025. pp. 240–251 (2025)

  19. [19]

    In: Text sum- marization branches out

    Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

  20. [20]

    arXiv preprint arXiv:2406.02377 (2024)

    Ma, Q., Ren, X., Huang, C.: Xrec: Large language models for explainable recom- mendation. arXiv preprint arXiv:2406.02377 (2024)

  21. [21]

    arXiv preprint arXiv:2305.14251 (2023)

    Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.t., Koh, P.W., Iyyer, M., Zettle- moyer, L., Hajishirzi, H.: Factscore: Fine-grained atomic evaluation of factual pre- cision in long form text generation. arXiv preprint arXiv:2305.14251 (2023)

  22. [22]

    In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

    Ni, J., Li, J., McAuley, J.: Justifying recommendations using distantly-labeled re- views and fine-grained aspects. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 188–197 (2019)

  23. [23]

    In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

  24. [24]

    In: ECAI 2023, pp

    Raczyński, J., Lango, M., Stefanowski, J.: The problem of coherence in natural language explanations of recommendations. In: ECAI 2023, pp. 1922–1929. IOS Press (2023)

  25. [25]

    OpenAI blog1(8), 9 (2019)

    Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

  26. [26]

    arXiv preprint arXiv:2103.12693 (2021)

    Scialom, T., Dray, P.A., Gallinari, P., Lamprier, S., Piwowarski, B., Staiano, J., Wang, A.: Questeval: Summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693 (2021)

  27. [27]

    Bleurt: Learning robust metrics for text generation,

    Sellam, T., Das, D., Parikh, A.P.: Bleurt: Learning robust metrics for text gener- ation. arXiv preprint arXiv:2004.04696 (2020)

  28. [28]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  29. [29]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  30. [30]

    In: Proceedings of the AAAI Confer- ence on Artificial Intelligence

    Xie, Z., Singh, S., McAuley, J., Majumder, B.P.: Factual and informative review generation for explainable recommendation. In: Proceedings of the AAAI Confer- ence on Artificial Intelligence. vol. 37, pp. 13816–13824 (2023)

  31. [31]

    Advances in neural information processing systems34, 27263–27277 (2021)

    Yuan, W., Neubig, G., Liu, P.: Bartscore: Evaluating generated text as text gener- ation. Advances in neural information processing systems34, 27263–27277 (2021)

  32. [32]

    arXiv preprint arXiv:2305.16739 (2023) Factual Explainable Recommendation Framework 13

    Zha, Y., Yang, Y., Li, R., Hu, Z.: Alignscore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739 (2023) Factual Explainable Recommendation Framework 13

  33. [33]

    BERTScore: Evaluating Text Generation with BERT

    Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019)

  34. [34]

    Foundations and Trends®in Information Retrieval14(1), 1–101 (2020)

    Zhang, Y., Chen, X., et al.: Explainable recommendation: A survey and new per- spectives. Foundations and Trends®in Information Retrieval14(1), 1–101 (2020)