Fault of Our Stars: Behavioral Drivers of Rating-Sentiment Incongruence
Pith reviewed 2026-07-02 21:21 UTC · model grok-4.3
The pith
Star ratings and the sentiment in accompanying review text frequently diverge, occurring in 18.6 percent of Sri Lankan tourism reviews and driven by identifiable behavioral patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sentiment expressed in review text differs from the sentiment implied by the assigned star rating in 18.6 percent of cases, falling into six directional patterns where Conservative Rater and Obligatory 5-Star behaviors predominate; venue type, reviewer expertise, review length, and temporal factors contribute to the divergence, showing that star ratings are not interchangeable with textual sentiment and require validation before use as ground-truth labels in NLP.
What carries the argument
Transformer-based sentiment pipeline that scores textual sentiment independently of the star rating assigned by the same reviewer.
If this is right
- NLP models trained on rating-derived labels will inherit systematic errors from the 18.6 percent mismatch rate.
- Review datasets should include text-based sentiment validation before treating stars as ground truth.
- Museums and similar venues will show higher mismatch rates than other attraction types.
- Reviewer expertise and review length can be used as predictors to flag likely incongruent cases.
- Temporal trends in mismatch prevalence can be tracked to detect changes in reviewer behavior.
Where Pith is reading between the lines
- The same mismatch patterns may appear in product or restaurant review corpora and could be quantified with the same pipeline.
- Platforms could reduce mismatches by prompting reviewers to align text and rating or by offering separate sentiment scales.
- Training data for sentiment models might improve if stars are down-weighted or replaced by direct text labels in high-incongruence domains.
- Extending the analysis to non-English reviews would test whether the observed behavioral drivers generalize across languages.
Load-bearing premise
The transformer pipeline produces an accurate measure of textual sentiment that does not depend on the star rating given by the reviewer.
What would settle it
Human annotators labeling sentiment on a large random sample of the same reviews and finding agreement with star ratings in well over 90 percent of cases would falsify the incongruence rate and its drivers.
Figures
read the original abstract
When people share experiences online, they often express thoughts in two ways: a star rating and a written review. In sentiment analysis, ratings are widely used as convenient weak labels for textual sentiment, yet whether the two actually agree is rarely questioned. This study investigates sentiment-rating incongruence, where the sentiment expressed in review text differs from the sentiment implied by the assigned star rating, in Sri Lankan tourism attraction reviews. A dataset of 16,156 reviews from 2010 to 2023 is analyzed using a transformer-based sentiment pipeline that derives textual sentiment independently of assigned ratings. Incongruence occurs in 18.6% of reviews and falls into six directional patterns, with Conservative Rater and Obligatory 5-Star behaviors accounting for the majority of mismatches. Prevalence also varies across venue types, with museums showing the highest rates. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as contributors to rating-text divergence. Overall, this study demonstrates that star ratings are not interchangeable with textual sentiment and should be validated before being treated as ground-truth labels in NLP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that analysis of 16,156 Sri Lankan tourism attraction reviews (2010-2023) with a transformer-based sentiment pipeline (independent of star ratings) reveals 18.6% incongruence between textual sentiment and assigned ratings. These mismatches fall into six directional patterns (dominated by Conservative Rater and Obligatory 5-Star behaviors), vary by venue type (highest in museums), and are driven by factors including venue type, reviewer expertise, review length, and temporal effects, as identified via statistical tests, logistic regression, Random Forest, and SHAP analysis. The central conclusion is that star ratings are not interchangeable with textual sentiment and require validation before use as ground-truth labels in NLP.
Significance. If the result holds after validation, the work supplies concrete empirical evidence that star ratings can systematically diverge from expressed sentiment in a real-world domain, with identifiable behavioral patterns and predictors. This directly challenges the widespread use of ratings as weak labels in sentiment analysis pipelines and dataset construction. The application of SHAP for driver interpretation and the combination of multiple modeling techniques (logistic regression + Random Forest) provide a reproducible template for similar observational studies.
major comments (3)
- [Abstract and Methods (sentiment pipeline)] Abstract and the Methods description of the transformer-based sentiment pipeline: no domain-specific fine-tuning, human validation set, accuracy metrics, or error analysis on Sri Lankan tourism English is reported. The 18.6% incongruence rate, six directional patterns, venue effects, and all SHAP-identified drivers are obtained by thresholding pipeline output against star ratings; without evidence that the pipeline is accurate on this domain, model error (e.g., sarcasm, politeness norms) cannot be ruled out as the source of detected mismatches.
- [Results (directional patterns)] Results section on directional patterns: the procedure for deriving the six directional patterns (including any sentiment-score thresholds or rules that define Conservative Rater, Obligatory 5-Star, etc.) is not specified, so it is impossible to assess reproducibility or sensitivity of the pattern frequencies.
- [Modeling and SHAP analysis] Modeling section (logistic regression and Random Forest): no performance metrics (accuracy, AUC, R², or cross-validation scores) are supplied for the models used to identify contributors to incongruence, weakening the claim that venue type, reviewer expertise, length, and temporal factors are reliable drivers.
minor comments (2)
- [Abstract] The abstract states the dataset size and time span but does not name the source platform (e.g., TripAdvisor) or any preprocessing steps applied to the 16,156 reviews.
- [Results] Figure or table presenting the six patterns would benefit from explicit threshold values or example reviews for each pattern to aid interpretability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the manuscript without misrepresenting the original analysis.
read point-by-point responses
-
Referee: [Abstract and Methods (sentiment pipeline)] Abstract and the Methods description of the transformer-based sentiment pipeline: no domain-specific fine-tuning, human validation set, accuracy metrics, or error analysis on Sri Lankan tourism English is reported. The 18.6% incongruence rate, six directional patterns, venue effects, and all SHAP-identified drivers are obtained by thresholding pipeline output against star ratings; without evidence that the pipeline is accurate on this domain, model error (e.g., sarcasm, politeness norms) cannot be ruled out as the source of detected mismatches.
Authors: We agree the manuscript should report pipeline performance details for this domain. The analysis used an off-the-shelf pre-trained transformer sentiment model applied independently of ratings. While this follows common practice for large-scale observational studies, we acknowledge the limitation. In revision we will add a dedicated validation subsection reporting accuracy on a held-out general-domain test set, a brief error analysis on a random sample of 200 Sri Lankan reviews (including examples of potential sarcasm or politeness issues), and an explicit discussion of how model error could affect incongruence estimates. This will allow readers to assess the robustness of the 18.6% figure. revision: yes
-
Referee: [Results (directional patterns)] Results section on directional patterns: the procedure for deriving the six directional patterns (including any sentiment-score thresholds or rules that define Conservative Rater, Obligatory 5-Star, etc.) is not specified, so it is impossible to assess reproducibility or sensitivity of the pattern frequencies.
Authors: The omission of explicit derivation rules was an oversight. The six patterns are obtained by crossing the pipeline's three-class sentiment output (positive/negative/neutral, thresholded at model probability >0.6) with binned star ratings (low:1-2, mid:3, high:4-5). Conservative Rater is defined as negative/neutral text with low/mid rating; Obligatory 5-Star as positive text with high rating but short length or low expertise. We will insert a new Methods subsection with the exact rules, probability thresholds, and a sensitivity table showing how pattern frequencies change under alternative thresholds (e.g., 0.5 vs 0.7). revision: yes
-
Referee: [Modeling and SHAP analysis] Modeling section (logistic regression and Random Forest): no performance metrics (accuracy, AUC, R², or cross-validation scores) are supplied for the models used to identify contributors to incongruence, weakening the claim that venue type, reviewer expertise, length, and temporal factors are reliable drivers.
Authors: We will add the requested metrics in the revised Modeling section. Logistic regression results will include AUC (via 5-fold CV) and McFadden's pseudo-R²; Random Forest will report out-of-bag accuracy, AUC, and permutation importance stability across 10 CV folds. These additions will directly support the reliability of the identified drivers (venue type, expertise, length, temporal factors) while preserving the original SHAP interpretations. revision: yes
Circularity Check
Empirical observational study with no derivations or self-referential predictions
full rationale
The paper is a purely empirical observational study that applies a transformer-based sentiment pipeline to a fixed dataset of 16,156 reviews, computes an 18.6% incongruence rate, identifies directional patterns, and runs standard statistical tests plus ML models (logistic regression, Random Forest, SHAP). No equations, fitted parameters presented as predictions, self-citation load-bearing premises, or ansatzes that reduce to inputs by construction appear in the provided text. The central claim rests on direct comparison of two independently obtained signals (pipeline output vs. star ratings) without any reduction of the reported quantities to the measurement process itself. This matches the default expectation of a non-circular empirical analysis.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Opinion mining and sentiment analysis,
B. Pang and L. Lee, “Opinion mining and sentiment analysis,”Founda- tions and Trends in Information Retrieval, vol. 2, no. 1–2, pp. 1–135, 2008
2008
-
[2]
Sentiment analysis in tourism: Capitalising on big data,
A. Alaei, S. Becken, and B. Stantic, “Sentiment analysis in tourism: Capitalising on big data,”Journal of Travel Research, vol. 58, no. 2, pp. 175–191, 2019
2019
-
[3]
Sentiment analysis applied to tourism: exploring tourist-generated content in the case of a wellness tourism destination,
O. A. George and C. M. Q. Ramos, “Sentiment analysis applied to tourism: exploring tourist-generated content in the case of a wellness tourism destination,”International Journal of Spa and Wellness, vol. 7, no. 2, pp. 139–161, 2024
2024
-
[4]
Are customer star ratings and sentiments aligned? a deep learning study of the customer service experience in tourism destinations,
E. Bigne, C. Ruiz, C. Perez-Cabanero, and A. Cuenca, “Are customer star ratings and sentiments aligned? a deep learning study of the customer service experience in tourism destinations,”Service Business, vol. 17, pp. 281–314, 2023
2023
-
[5]
Beyond the stars: The impact of rating-text inconsistency on perceived review usefulness,
B. Kwon, J. Lee, J. Min, C. Kwak, and H. B. S. Choi, “Beyond the stars: The impact of rating-text inconsistency on perceived review usefulness,” Asia Pacific Journal of Information Systems, vol. 35, no. 1, pp. 49–72, 2025
2025
-
[6]
Sentiment analysis for hotel reviews: A systematic literature review,
A. Ameur, S. Hamdi, and S. B. Yahia, “Sentiment analysis for hotel reviews: A systematic literature review,”ACM Computing Surveys, vol. 56, no. 2, p. Article 51, Sep. 2023
2023
-
[7]
Language interpretation in travel guidance platform: Text mining and sentiment analysis of tripadvisor reviews,
M. Chu, Y . Chen, L. Yang, and J. Wang, “Language interpretation in travel guidance platform: Text mining and sentiment analysis of tripadvisor reviews,”Frontiers in Psychology, Oct. 2022
2022
-
[8]
Analyzing tourism reviews using an lda topic-based sentiment analysis approach,
T. Ali, B. Omar, and K. Soulaimane, “Analyzing tourism reviews using an lda topic-based sentiment analysis approach,”MethodsX, vol. 9, p. 101894, Nov. 2022
2022
-
[9]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inNAACL, Minneapolis, MN, USA, Jun. 2019, pp. 4171–4186
2019
-
[10]
ERNIE: Enhanced Representation through Knowledge Integration
Y . Sun, S. Wang, Y . Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu, “Ernie: Enhanced representation through knowledge integration,”arXiv preprint arXiv:1904.09223, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[11]
Sentiment analysis of hotel online reviews using the bert model and ernie model—data from china,
Y . Wen, Y . Liang, and X. Zhu, “Sentiment analysis of hotel online reviews using the bert model and ernie model—data from china,”PLOS ONE, vol. 18, no. 3, p. e0275382, Mar. 2023
2023
-
[12]
Predicting sentiment and rating of tourist reviews using machine learning,
K. Puh and M. B. Babac, “Predicting sentiment and rating of tourist reviews using machine learning,”Journal of Hospitality and Tourism Insights, vol. 6, no. 3, pp. 1188–1204, 2023
2023
-
[13]
Tourism and travel reviews: Sri lankan destinations,
T. Sewwandi, “Tourism and travel reviews: Sri lankan destinations,” Mendeley Data, V1, 2023
2023
-
[14]
Exploring tourist experience through online reviews using aspect-based sentiment analysis with zero-shot learning for hospitality service enhancement,
I. Nawawi, K. F. Ilmawan, M. F. Maarif, and M. Syafrudin, “Exploring tourist experience through online reviews using aspect-based sentiment analysis with zero-shot learning for hospitality service enhancement,” Information, vol. 15, no. 8, p. 499, Aug. 2024
2024
-
[15]
Sentiment analysis in user reviews: A study of incompatibility in hotel reviews in city of anuradhapura, sri lanka,
H. P. P. M. Abeysinghe and C. K. Walgampaya, “Sentiment analysis in user reviews: A study of incompatibility in hotel reviews in city of anuradhapura, sri lanka,” inProceedings of iPURSE, vol. 23, Peradeniya, Sri Lanka, Nov. 2021
2021
-
[16]
A novel self-learning approach to overcome incompatibility on tripadvisor reviews,
P. Abeysinghe and T. Bandara, “A novel self-learning approach to overcome incompatibility on tripadvisor reviews,”Data Science and Management, vol. 5, pp. 1–10, 2022
2022
-
[17]
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research,
N. de Silva, “Survey on Publicly Available Sinhala Natural Language Processing Tools and Research,”arXiv preprint arXiv:1906.02358v26, 2026
-
[18]
Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts,
V . Jayawickrama, G. Weeraprameshwara, N. de Silva, and Y . Wijeratne, “Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts,” inInternational Conference on Advances in ICT for Emerging Regions, 2021, pp. 177–182
2021
-
[19]
Facebook for sentiment analysis: Baseline models to predict facebook reactions of sinhala posts,
——, “Facebook for sentiment analysis: Baseline models to predict facebook reactions of sinhala posts,”The International Journal on Advances in ICT for Emerging Regions, vol. 15, no. 2, 2022
2022
-
[20]
Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages,
G. Weeraprameshwara, V . Jayawickrama, N. de Silva, and Y . Wijeratne, “Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages,” inProceedings of the 36th Pacific Asia Conference on Language, Information and Computation, 2022, pp. 325–336
2022
-
[21]
Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data,
——, “Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data,” in2022 The 3rd International Conference on Artificial Intelligence in Electronics Engineering. Association for Computing Machinery, 2022, pp. 16–22
2022
-
[22]
Understanding review helpfulness as a function of reviewer reputation, review rating, and review depth,
A. Y . K. Chua and S. Banerjee, “Understanding review helpfulness as a function of reviewer reputation, review rating, and review depth,”Journal of the Association for Information Science and Technology, vol. 66, no. 2, pp. 354–362, 2015
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.