From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification
Pith reviewed 2026-05-22 06:41 UTC · model grok-4.3
The pith
RoBERTa achieves 93.02 percent accuracy in classifying positive and negative movie reviews on the IMDb dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper evaluates Naive Bayes, Logistic Regression, Support Vector Machines, LightGBM, LSTM, RoBERTa, and DistilBERT on the IMDb dataset for binary sentiment classification. RoBERTa performed better than all the other models, with an accuracy of 93.02 percent. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.
What carries the argument
Head-to-head comparison of classical statistical and deep-learning models against transformer models, followed by soft-voting ensemble aggregation for final sentiment predictions.
If this is right
- RoBERTa can serve as a strong default model for high-accuracy binary sentiment tasks on review-style text.
- Soft-voting ensembles that mix classical and transformer models produce better results than any single model alone.
- Transformer architectures capture contextual cues in reviews more effectively than count-based or tree-based methods.
- The performance ordering among models provides a practical benchmark for choosing tools in production review analysis.
Where Pith is reading between the lines
- Similar comparative tests could reveal whether the same ordering appears in non-English review data or in shorter social-media texts.
- Production systems might start with RoBERTa and add ensemble layers only when marginal gains justify the extra compute.
- The gap between statistical models and transformers suggests that pre-training on large corpora reduces reliance on hand-crafted features for this task.
- If the pattern holds across domains, organizations could standardize on transformer-plus-ensemble pipelines for opinion mining rather than maintaining multiple separate classifiers.
Load-bearing premise
The IMDb train/test split and its positive/negative labels are representative of real-world sentiment classification tasks and that standard preprocessing does not favor transformer models over simpler ones.
What would settle it
Applying the identical set of models and the same preprocessing pipeline to a different binary sentiment dataset, such as product reviews or social media posts, and finding that RoBERTa no longer records the highest accuracy or that the ensemble shows no gain.
Figures
read the original abstract
Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts an empirical comparison of machine learning models for binary sentiment classification on the IMDb movie review dataset. It evaluates Naive Bayes, Logistic Regression, SVM, LightGBM, LSTM, DistilBERT, and RoBERTa, reporting that RoBERTa achieves the highest accuracy of 93.02% along with strong precision, recall, F1, and ROC-AUC scores. A soft-voting ensemble combining all models is shown to yield further performance gains.
Significance. If the performance ordering is shown to be robust to equivalent hyperparameter optimization across models, the work would provide a useful, multi-metric benchmark illustrating the practical advantage of transformer architectures over classical and LSTM baselines for sentiment analysis, as well as the benefit of simple ensembling. The use of a standard public dataset and reporting of multiple evaluation metrics (accuracy, precision, recall, F1, ROC-AUC) are positive features that support reproducibility.
major comments (2)
- [Experimental Setup / Methods] The experimental setup section provides no description of hyperparameter search procedures, validation-set usage for model selection, learning-rate schedules, batch sizes, epoch counts, or tuning budget allocated to each model. This is load-bearing for the central claim that RoBERTa (93.02% accuracy) outperforms the others, because transformer fine-tuning typically involves more degrees of freedom than classical models; without explicit controls the observed gap could reflect optimization disparity rather than model capability.
- [Results] The results section reports point estimates on the standard IMDb split but includes no statistical significance tests (e.g., McNemar or bootstrap confidence intervals) comparing RoBERTa to the next-best model, nor any ablation on whether the soft-voting ensemble improvement is statistically reliable. This weakens the strength of the headline comparison and ensemble claim.
minor comments (2)
- [Abstract] The abstract states that 'a lot of testing' was performed but does not quantify the ensemble's improvement (e.g., the exact accuracy or F1 gain over RoBERTa alone).
- [Data Preprocessing] Preprocessing details (tokenization, TF-IDF vectorizer parameters, maximum sequence length for transformers) are mentioned only at a high level; adding a short table or paragraph would improve clarity.
Simulated Author's Rebuttal
We appreciate the referee's insightful comments on our empirical comparison of machine learning models for sentiment classification. We respond to each major comment below, indicating the revisions we will make to address the concerns.
read point-by-point responses
-
Referee: [Experimental Setup / Methods] The experimental setup section provides no description of hyperparameter search procedures, validation-set usage for model selection, learning-rate schedules, batch sizes, epoch counts, or tuning budget allocated to each model. This is load-bearing for the central claim that RoBERTa (93.02% accuracy) outperforms the others, because transformer fine-tuning typically involves more degrees of freedom than classical models; without explicit controls the observed gap could reflect optimization disparity rather than model capability.
Authors: We agree that the original manuscript omitted key details on hyperparameter procedures, which limits the ability to fully evaluate the fairness of the model comparisons. In the revised version, we will expand the experimental setup section to explicitly describe the procedures used: grid or random search with cross-validation on the training portion for classical models (Naive Bayes, Logistic Regression, SVM, LightGBM), and the specific fine-tuning settings (learning rates, batch sizes, epochs, and validation monitoring) applied to LSTM, DistilBERT, and RoBERTa following standard library recommendations. This addition will clarify the optimization approach taken and support the reported performance ordering. revision: yes
-
Referee: [Results] The results section reports point estimates on the standard IMDb split but includes no statistical significance tests (e.g., McNemar or bootstrap confidence intervals) comparing RoBERTa to the next-best model, nor any ablation on whether the soft-voting ensemble improvement is statistically reliable. This weakens the strength of the headline comparison and ensemble claim.
Authors: We concur that statistical tests and ensemble analysis would strengthen the results. In the revision, we will augment the results section with bootstrap confidence intervals around the reported metrics and McNemar's test to evaluate whether RoBERTa's gains over the next-best model are statistically significant. We will also add a brief ablation or comparative analysis quantifying the soft-voting ensemble's improvement relative to the best individual model, including measures of reliability. These elements will be computed on the existing splits and included in the updated manuscript. revision: yes
Circularity Check
No circularity: purely empirical model comparison on held-out labels
full rationale
The paper reports direct accuracy measurements (e.g., RoBERTa at 93.02%) from running standard classifiers and transformers on the fixed IMDb train/test split and comparing predictions to ground-truth labels. No equations, derivations, fitted parameters renamed as predictions, or self-citations are used to generate the headline result; performance numbers are external measurements against an independent benchmark. The analysis contains no self-definitional steps, uniqueness theorems, or ansatzes that reduce outputs to inputs by construction. This is a standard empirical comparison and remains self-contained against external dataset labels.
Axiom & Free-Parameter Ledger
free parameters (1)
- model-specific hyperparameters
axioms (2)
- domain assumption IMDb labels accurately reflect binary sentiment
- domain assumption Standard train/test split is unbiased
Reference graph
Works this paper leans on
-
[1]
A. F. AlShammari, “Implementation of keyword extraction using term frequency-inverse document frequency (tf-idf) in python,”International Journal of Computer Applications, vol. 185, no. 35, pp. 9–14,
-
[2]
Available: https://ijcaonline.org/archives/volume185/ number35/32916-2023923137/
[Online]. Available: https://ijcaonline.org/archives/volume185/ number35/32916-2023923137/
-
[3]
Multinomial naive bayes for text categorization revisited,
A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes, “Multinomial naive bayes for text categorization revisited,” inAI 2004: Advances in Artificial Intelligence, 2005, pp. 488–499
work page 2004
-
[4]
Sentiment analysis on imdb movie reviews using hybrid feature extraction method,
K. Kumar, B. S. Harish, and H. K. Darshan, “Sentiment analysis on imdb movie reviews using hybrid feature extraction method,”International Journal of Interactive Multimedia and Artificial Intelligence, vol. 5, pp. 109–114, 2019. [Online]. Available: https://api.semanticscholar.org/ CorpusID:59253424
work page 2019
-
[5]
Movie review analysis: Emotion analysis of imdb movie reviews,
K. Topal and G. Ozsoyoglu, “Movie review analysis: Emotion analysis of imdb movie reviews,” inProc. IEEE/ACM Int. Conf. Advances in Social Networks Analysis and Mining (ASONAM), 2016, pp. 1170–1176
work page 2016
-
[6]
Imdb dataset of 50k movie reviews,
L. N. Pathi, “Imdb dataset of 50k movie reviews,” https://www.kaggle. com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews, 2018, accessed: Sep. 19, 2025
work page 2018
-
[7]
An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation
B. Das and S. Chakraborty, “An improved text sentiment classification model using tf-idf and next word negation,”arXiv, vol. abs/1806.06407,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Available: https://api.semanticscholar.org/CorpusID: 49301351
[Online]. Available: https://api.semanticscholar.org/CorpusID: 49301351
-
[9]
Comparative study of static and contextual text vectorization for sentiment analysis,
A. D. Bhargavi, “Comparative study of static and contextual text vectorization for sentiment analysis,”International Journal for Research in Applied Science and Engineering Technology, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:280216277
work page 2025
-
[10]
Comparative analysis of transformer models for sentiment classification of uk cbdc discourse on x,
G. Kaur, S. Haraldsson, and A. Bracciali, “Comparative analysis of transformer models for sentiment classification of uk cbdc discourse on x,”Discover Analytics, vol. 3, no. 1, p. 7, 2025. [Online]. Available: https://doi.org/10.1007/s44257-025-00035-4
-
[11]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019. [Online]. Available: https://arxiv.org/abs/1907.11692
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[12]
A unified approach to interpreting model predictions,
S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inProc. Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 4768–4777
work page 2017
-
[13]
Bert meets shapley: Extending shap explanations to transformer-based classifiers,
E. Kokalj, B. ˇSkrlj, N. Lavra ˇc, S. Pollak, and M. Robnik- ˇSikonja, “Bert meets shapley: Extending shap explanations to transformer-based classifiers,” inProc. EACL Hackashop on News Media Content Analysis and Automated Report Generation, 2021, pp. 16–21. [Online]. Available: https://aclanthology.org/2021.hackashop-1.3/
work page 2021
-
[14]
W. Khan, J. Ahmad, N. Alasbali, A. A. Mazroa, M. S. Alshehri, and M. S. Khan, “A novel transformer-based explainable ai approach using shap for intrusion detection in vehicular ad hoc networks,” Computer Networks, vol. 270, p. 111575, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1389128625005420
work page 2025
-
[15]
Advancing sentiment analysis: Evaluating roberta against traditional and deep learning models,
P. Pookduang, R. Klangbunrueang, W. Chansanam, and T. Lunrasri, “Advancing sentiment analysis: Evaluating roberta against traditional and deep learning models,”Engineering, Technology & Applied Science Research, 2025. [Online]. Available: https://api.semanticscholar.org/ CorpusID:276099959
work page 2025
-
[16]
Web-based sentiment analysis system using svm and tf-idf with statistical feature,
M. Q. Huzyan Octava, D. G. Prasetyo Putri, F. M. Hilmy, U. Farooq, R. A. Nurhaliza, and G. Alfian, “Web-based sentiment analysis system using svm and tf-idf with statistical feature,” inProc. Int. Conf. Inno- vation and Intelligence for Informatics, Computing, and Technologies (3ICT), 2023, pp. 9–14
work page 2023
-
[17]
M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i trust you?”: Explaining the predictions of any classifier,” inProc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), 2016, pp. 1135–
work page 2016
-
[18]
[Online]. Available: https://doi.org/10.1145/2939672.2939778
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.