pith. sign in

arxiv: 2606.25518 · v2 · pith:6DLIVEJSnew · submitted 2026-06-24 · 💻 cs.CL

Fault of Our Stars: Behavioral Drivers of Rating-Sentiment Incongruence

Pith reviewed 2026-07-02 21:21 UTC · model grok-4.3

classification 💻 cs.CL
keywords sentiment analysisstar ratingsreview incongruenceNLP weak labelstourism reviewsbehavioral patternsrating validationtextual sentiment
0
0 comments X

The pith

Star ratings and the sentiment in accompanying review text frequently diverge, occurring in 18.6 percent of Sri Lankan tourism reviews and driven by identifiable behavioral patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how often the star rating a reviewer assigns fails to match the positive or negative tone expressed in the written text. It processes 16,156 reviews with a transformer pipeline that scores sentiment directly from the words, independent of the stars. Mismatches appear in six directional patterns, with most explained by conservative rating or automatic five-star habits. Venue type, reviewer expertise, length, and time period also correlate with higher divergence rates. The work concludes that ratings cannot be swapped in for textual sentiment labels without separate checks.

Core claim

Sentiment expressed in review text differs from the sentiment implied by the assigned star rating in 18.6 percent of cases, falling into six directional patterns where Conservative Rater and Obligatory 5-Star behaviors predominate; venue type, reviewer expertise, review length, and temporal factors contribute to the divergence, showing that star ratings are not interchangeable with textual sentiment and require validation before use as ground-truth labels in NLP.

What carries the argument

Transformer-based sentiment pipeline that scores textual sentiment independently of the star rating assigned by the same reviewer.

If this is right

  • NLP models trained on rating-derived labels will inherit systematic errors from the 18.6 percent mismatch rate.
  • Review datasets should include text-based sentiment validation before treating stars as ground truth.
  • Museums and similar venues will show higher mismatch rates than other attraction types.
  • Reviewer expertise and review length can be used as predictors to flag likely incongruent cases.
  • Temporal trends in mismatch prevalence can be tracked to detect changes in reviewer behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mismatch patterns may appear in product or restaurant review corpora and could be quantified with the same pipeline.
  • Platforms could reduce mismatches by prompting reviewers to align text and rating or by offering separate sentiment scales.
  • Training data for sentiment models might improve if stars are down-weighted or replaced by direct text labels in high-incongruence domains.
  • Extending the analysis to non-English reviews would test whether the observed behavioral drivers generalize across languages.

Load-bearing premise

The transformer pipeline produces an accurate measure of textual sentiment that does not depend on the star rating given by the reviewer.

What would settle it

Human annotators labeling sentiment on a large random sample of the same reviews and finding agreement with star ratings in well over 90 percent of cases would falsify the incongruence rate and its drivers.

Figures

Figures reproduced from arXiv: 2606.25518 by Anusan Krishnathas, Asma Rauff, Kovindarajah Sriyathurshan, Kusal Amantha, Nirasha Munasinghe, Nisansa de Silva, Patalee Narasinghe, Ramanaish Abaiyan, Ruththiragayan Sutharsan, Sandareka Wickramanayake.

Figure 2
Figure 2. Figure 2: Incongruence rate by venue type. C. Predictors of Incongruence Bivariate screening with Benjamini–Hochberg correction was used to identify predictors associated with incongruence. As shown in Table V, reviewer tier, province, travel year, and review length remained significant after correction, while review delay was not significant (q = 0.7503). Expert reviewers were 1.97 times more likely than novices to… view at source ↗
Figure 1
Figure 1. Figure 1: Overview of the four-phase methodology. TABLE I SOURCE COLUMNS USED IN ANALYSIS Feature Description Location Used to derive province and district Location_Type Type of attraction (museum, beach etc.) User_Contributions Total reviews posted by the reviewer Travel_Date Date of visit; used to derive travel year Published_Date Date of review posting; used to derive review delay Rating Star rating used to creat… view at source ↗
Figure 1
Figure 1. Figure 1: Distribution of the six directional incongruence patterns. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Selected incongruence patterns by reviewer expertise, showing higher [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Incongruence rate by venue type. characteristics and review content are associated with sentiment– rating mismatch, although multivariable modeling is needed to test their independent effects [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Selected incongruence patterns by reviewer expertise, showing higher [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

When people share experiences online, they often express thoughts in two ways: a star rating and a written review. In sentiment analysis, ratings are widely used as convenient weak labels for textual sentiment, yet whether the two actually agree is rarely questioned. This study investigates sentiment-rating incongruence, where the sentiment expressed in review text differs from the sentiment implied by the assigned star rating, in Sri Lankan tourism attraction reviews. A dataset of 16,156 reviews from 2010 to 2023 is analyzed using a transformer-based sentiment pipeline that derives textual sentiment independently of assigned ratings. Incongruence occurs in 18.6% of reviews and falls into six directional patterns, with Conservative Rater and Obligatory 5-Star behaviors accounting for the majority of mismatches. Prevalence also varies across venue types, with museums showing the highest rates. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as contributors to rating-text divergence. Overall, this study demonstrates that star ratings are not interchangeable with textual sentiment and should be validated before being treated as ground-truth labels in NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that analysis of 16,156 Sri Lankan tourism attraction reviews (2010-2023) with a transformer-based sentiment pipeline (independent of star ratings) reveals 18.6% incongruence between textual sentiment and assigned ratings. These mismatches fall into six directional patterns (dominated by Conservative Rater and Obligatory 5-Star behaviors), vary by venue type (highest in museums), and are driven by factors including venue type, reviewer expertise, review length, and temporal effects, as identified via statistical tests, logistic regression, Random Forest, and SHAP analysis. The central conclusion is that star ratings are not interchangeable with textual sentiment and require validation before use as ground-truth labels in NLP.

Significance. If the result holds after validation, the work supplies concrete empirical evidence that star ratings can systematically diverge from expressed sentiment in a real-world domain, with identifiable behavioral patterns and predictors. This directly challenges the widespread use of ratings as weak labels in sentiment analysis pipelines and dataset construction. The application of SHAP for driver interpretation and the combination of multiple modeling techniques (logistic regression + Random Forest) provide a reproducible template for similar observational studies.

major comments (3)
  1. [Abstract and Methods (sentiment pipeline)] Abstract and the Methods description of the transformer-based sentiment pipeline: no domain-specific fine-tuning, human validation set, accuracy metrics, or error analysis on Sri Lankan tourism English is reported. The 18.6% incongruence rate, six directional patterns, venue effects, and all SHAP-identified drivers are obtained by thresholding pipeline output against star ratings; without evidence that the pipeline is accurate on this domain, model error (e.g., sarcasm, politeness norms) cannot be ruled out as the source of detected mismatches.
  2. [Results (directional patterns)] Results section on directional patterns: the procedure for deriving the six directional patterns (including any sentiment-score thresholds or rules that define Conservative Rater, Obligatory 5-Star, etc.) is not specified, so it is impossible to assess reproducibility or sensitivity of the pattern frequencies.
  3. [Modeling and SHAP analysis] Modeling section (logistic regression and Random Forest): no performance metrics (accuracy, AUC, R², or cross-validation scores) are supplied for the models used to identify contributors to incongruence, weakening the claim that venue type, reviewer expertise, length, and temporal factors are reliable drivers.
minor comments (2)
  1. [Abstract] The abstract states the dataset size and time span but does not name the source platform (e.g., TripAdvisor) or any preprocessing steps applied to the 16,156 reviews.
  2. [Results] Figure or table presenting the six patterns would benefit from explicit threshold values or example reviews for each pattern to aid interpretability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the manuscript without misrepresenting the original analysis.

read point-by-point responses
  1. Referee: [Abstract and Methods (sentiment pipeline)] Abstract and the Methods description of the transformer-based sentiment pipeline: no domain-specific fine-tuning, human validation set, accuracy metrics, or error analysis on Sri Lankan tourism English is reported. The 18.6% incongruence rate, six directional patterns, venue effects, and all SHAP-identified drivers are obtained by thresholding pipeline output against star ratings; without evidence that the pipeline is accurate on this domain, model error (e.g., sarcasm, politeness norms) cannot be ruled out as the source of detected mismatches.

    Authors: We agree the manuscript should report pipeline performance details for this domain. The analysis used an off-the-shelf pre-trained transformer sentiment model applied independently of ratings. While this follows common practice for large-scale observational studies, we acknowledge the limitation. In revision we will add a dedicated validation subsection reporting accuracy on a held-out general-domain test set, a brief error analysis on a random sample of 200 Sri Lankan reviews (including examples of potential sarcasm or politeness issues), and an explicit discussion of how model error could affect incongruence estimates. This will allow readers to assess the robustness of the 18.6% figure. revision: yes

  2. Referee: [Results (directional patterns)] Results section on directional patterns: the procedure for deriving the six directional patterns (including any sentiment-score thresholds or rules that define Conservative Rater, Obligatory 5-Star, etc.) is not specified, so it is impossible to assess reproducibility or sensitivity of the pattern frequencies.

    Authors: The omission of explicit derivation rules was an oversight. The six patterns are obtained by crossing the pipeline's three-class sentiment output (positive/negative/neutral, thresholded at model probability >0.6) with binned star ratings (low:1-2, mid:3, high:4-5). Conservative Rater is defined as negative/neutral text with low/mid rating; Obligatory 5-Star as positive text with high rating but short length or low expertise. We will insert a new Methods subsection with the exact rules, probability thresholds, and a sensitivity table showing how pattern frequencies change under alternative thresholds (e.g., 0.5 vs 0.7). revision: yes

  3. Referee: [Modeling and SHAP analysis] Modeling section (logistic regression and Random Forest): no performance metrics (accuracy, AUC, R², or cross-validation scores) are supplied for the models used to identify contributors to incongruence, weakening the claim that venue type, reviewer expertise, length, and temporal factors are reliable drivers.

    Authors: We will add the requested metrics in the revised Modeling section. Logistic regression results will include AUC (via 5-fold CV) and McFadden's pseudo-R²; Random Forest will report out-of-bag accuracy, AUC, and permutation importance stability across 10 CV folds. These additions will directly support the reliability of the identified drivers (venue type, expertise, length, temporal factors) while preserving the original SHAP interpretations. revision: yes

Circularity Check

0 steps flagged

Empirical observational study with no derivations or self-referential predictions

full rationale

The paper is a purely empirical observational study that applies a transformer-based sentiment pipeline to a fixed dataset of 16,156 reviews, computes an 18.6% incongruence rate, identifies directional patterns, and runs standard statistical tests plus ML models (logistic regression, Random Forest, SHAP). No equations, fitted parameters presented as predictions, self-citation load-bearing premises, or ansatzes that reduce to inputs by construction appear in the provided text. The central claim rests on direct comparison of two independently obtained signals (pipeline output vs. star ratings) without any reduction of the reported quantities to the measurement process itself. This matches the default expectation of a non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation is present; the study is observational and relies on standard statistical and machine-learning tools without introducing free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5791 in / 1101 out tokens · 32757 ms · 2026-07-02T21:21:35.256283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Opinion mining and sentiment analysis,

    B. Pang and L. Lee, “Opinion mining and sentiment analysis,”Founda- tions and Trends in Information Retrieval, vol. 2, no. 1–2, pp. 1–135, 2008

  2. [2]

    Sentiment analysis in tourism: Capitalising on big data,

    A. Alaei, S. Becken, and B. Stantic, “Sentiment analysis in tourism: Capitalising on big data,”Journal of Travel Research, vol. 58, no. 2, pp. 175–191, 2019

  3. [3]

    Sentiment analysis applied to tourism: exploring tourist-generated content in the case of a wellness tourism destination,

    O. A. George and C. M. Q. Ramos, “Sentiment analysis applied to tourism: exploring tourist-generated content in the case of a wellness tourism destination,”International Journal of Spa and Wellness, vol. 7, no. 2, pp. 139–161, 2024

  4. [4]

    Are customer star ratings and sentiments aligned? a deep learning study of the customer service experience in tourism destinations,

    E. Bigne, C. Ruiz, C. Perez-Cabanero, and A. Cuenca, “Are customer star ratings and sentiments aligned? a deep learning study of the customer service experience in tourism destinations,”Service Business, vol. 17, pp. 281–314, 2023

  5. [5]

    Beyond the stars: The impact of rating-text inconsistency on perceived review usefulness,

    B. Kwon, J. Lee, J. Min, C. Kwak, and H. B. S. Choi, “Beyond the stars: The impact of rating-text inconsistency on perceived review usefulness,” Asia Pacific Journal of Information Systems, vol. 35, no. 1, pp. 49–72, 2025

  6. [6]

    Sentiment analysis for hotel reviews: A systematic literature review,

    A. Ameur, S. Hamdi, and S. B. Yahia, “Sentiment analysis for hotel reviews: A systematic literature review,”ACM Computing Surveys, vol. 56, no. 2, p. Article 51, Sep. 2023

  7. [7]

    Language interpretation in travel guidance platform: Text mining and sentiment analysis of tripadvisor reviews,

    M. Chu, Y . Chen, L. Yang, and J. Wang, “Language interpretation in travel guidance platform: Text mining and sentiment analysis of tripadvisor reviews,”Frontiers in Psychology, Oct. 2022

  8. [8]

    Analyzing tourism reviews using an lda topic-based sentiment analysis approach,

    T. Ali, B. Omar, and K. Soulaimane, “Analyzing tourism reviews using an lda topic-based sentiment analysis approach,”MethodsX, vol. 9, p. 101894, Nov. 2022

  9. [9]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inNAACL, Minneapolis, MN, USA, Jun. 2019, pp. 4171–4186

  10. [10]

    ERNIE: Enhanced Representation through Knowledge Integration

    Y . Sun, S. Wang, Y . Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu, “Ernie: Enhanced representation through knowledge integration,”arXiv preprint arXiv:1904.09223, 2019

  11. [11]

    Sentiment analysis of hotel online reviews using the bert model and ernie model—data from china,

    Y . Wen, Y . Liang, and X. Zhu, “Sentiment analysis of hotel online reviews using the bert model and ernie model—data from china,”PLOS ONE, vol. 18, no. 3, p. e0275382, Mar. 2023

  12. [12]

    Predicting sentiment and rating of tourist reviews using machine learning,

    K. Puh and M. B. Babac, “Predicting sentiment and rating of tourist reviews using machine learning,”Journal of Hospitality and Tourism Insights, vol. 6, no. 3, pp. 1188–1204, 2023

  13. [13]

    Tourism and travel reviews: Sri lankan destinations,

    T. Sewwandi, “Tourism and travel reviews: Sri lankan destinations,” Mendeley Data, V1, 2023

  14. [14]

    Exploring tourist experience through online reviews using aspect-based sentiment analysis with zero-shot learning for hospitality service enhancement,

    I. Nawawi, K. F. Ilmawan, M. F. Maarif, and M. Syafrudin, “Exploring tourist experience through online reviews using aspect-based sentiment analysis with zero-shot learning for hospitality service enhancement,” Information, vol. 15, no. 8, p. 499, Aug. 2024

  15. [15]

    Sentiment analysis in user reviews: A study of incompatibility in hotel reviews in city of anuradhapura, sri lanka,

    H. P. P. M. Abeysinghe and C. K. Walgampaya, “Sentiment analysis in user reviews: A study of incompatibility in hotel reviews in city of anuradhapura, sri lanka,” inProceedings of iPURSE, vol. 23, Peradeniya, Sri Lanka, Nov. 2021

  16. [16]

    A novel self-learning approach to overcome incompatibility on tripadvisor reviews,

    P. Abeysinghe and T. Bandara, “A novel self-learning approach to overcome incompatibility on tripadvisor reviews,”Data Science and Management, vol. 5, pp. 1–10, 2022

  17. [17]

    Survey on Publicly Available Sinhala Natural Language Processing Tools and Research,

    N. de Silva, “Survey on Publicly Available Sinhala Natural Language Processing Tools and Research,”arXiv preprint arXiv:1906.02358v26, 2026

  18. [18]

    Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts,

    V . Jayawickrama, G. Weeraprameshwara, N. de Silva, and Y . Wijeratne, “Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts,” inInternational Conference on Advances in ICT for Emerging Regions, 2021, pp. 177–182

  19. [19]

    Facebook for sentiment analysis: Baseline models to predict facebook reactions of sinhala posts,

    ——, “Facebook for sentiment analysis: Baseline models to predict facebook reactions of sinhala posts,”The International Journal on Advances in ICT for Emerging Regions, vol. 15, no. 2, 2022

  20. [20]

    Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages,

    G. Weeraprameshwara, V . Jayawickrama, N. de Silva, and Y . Wijeratne, “Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages,” inProceedings of the 36th Pacific Asia Conference on Language, Information and Computation, 2022, pp. 325–336

  21. [21]

    Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data,

    ——, “Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data,” in2022 The 3rd International Conference on Artificial Intelligence in Electronics Engineering. Association for Computing Machinery, 2022, pp. 16–22

  22. [22]

    Understanding review helpfulness as a function of reviewer reputation, review rating, and review depth,

    A. Y . K. Chua and S. Banerjee, “Understanding review helpfulness as a function of reviewer reputation, review rating, and review depth,”Journal of the Association for Information Science and Technology, vol. 66, no. 2, pp. 354–362, 2015