pith. sign in

arxiv: 2605.22003 · v1 · pith:VDLUFYTXnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

Pith reviewed 2026-05-22 06:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords sentiment analysisIMDb datasetRoBERTatransformer modelsensemble learningmovie reviewstext classificationnatural language processing
0
0 comments X

The pith

RoBERTa achieves 93.02 percent accuracy in classifying positive and negative movie reviews on the IMDb dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates a range of models from basic statistical classifiers to transformer architectures to sort movie reviews into positive or negative categories. It uses the standard IMDb dataset after typical text preprocessing steps and measures success with accuracy, precision, recall, F1-score, and ROC-AUC. The central effort is to identify which approach handles the contextual and statistical aspects of sentiment most effectively. A sympathetic reader would care because reliable automated sentiment tools can process large volumes of customer or audience feedback without manual review. Results show that RoBERTa leads the individual models and that combining them in a soft-voting ensemble lifts performance further.

Core claim

The paper evaluates Naive Bayes, Logistic Regression, Support Vector Machines, LightGBM, LSTM, RoBERTa, and DistilBERT on the IMDb dataset for binary sentiment classification. RoBERTa performed better than all the other models, with an accuracy of 93.02 percent. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.

What carries the argument

Head-to-head comparison of classical statistical and deep-learning models against transformer models, followed by soft-voting ensemble aggregation for final sentiment predictions.

If this is right

  • RoBERTa can serve as a strong default model for high-accuracy binary sentiment tasks on review-style text.
  • Soft-voting ensembles that mix classical and transformer models produce better results than any single model alone.
  • Transformer architectures capture contextual cues in reviews more effectively than count-based or tree-based methods.
  • The performance ordering among models provides a practical benchmark for choosing tools in production review analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar comparative tests could reveal whether the same ordering appears in non-English review data or in shorter social-media texts.
  • Production systems might start with RoBERTa and add ensemble layers only when marginal gains justify the extra compute.
  • The gap between statistical models and transformers suggests that pre-training on large corpora reduces reliance on hand-crafted features for this task.
  • If the pattern holds across domains, organizations could standardize on transformer-plus-ensemble pipelines for opinion mining rather than maintaining multiple separate classifiers.

Load-bearing premise

The IMDb train/test split and its positive/negative labels are representative of real-world sentiment classification tasks and that standard preprocessing does not favor transformer models over simpler ones.

What would settle it

Applying the identical set of models and the same preprocessing pipeline to a different binary sentiment dataset, such as product reviews or social media posts, and finding that RoBERTa no longer records the highest accuracy or that the ensemble shows no gain.

Figures

Figures reproduced from arXiv: 2605.22003 by Dip Biswas Shanto, Mitali Yadav, Prajwal Panth, Suresh Chandra Satapathy.

Figure 1
Figure 1. Figure 1: Soft Voting Ensemble Architecture. III. DATASET AND PREPROCESSING A. Dataset For the dataset, the widely recognized IMDb dataset [5] was used, that compromised 50,000 movie reviews. The reviews were then split in half between training and test sets. Each set is balanced, 25,000 labelled positive and 25,000 labelled negative. The dataset had no missing or mismatched entries. The sheer size of the dataset ma… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation of Model Performance Metrics: Accuracy, F1-Score, and ROC-AUC. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confusion Matrices for LightGBM and SVM models. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Confusion matrix of LSTM model [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Confusion Matrices for DistilBERT and RoBERTa models. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion Matrices for Logistic Regression and Na¨ [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 7
Figure 7. Figure 7: SHAP feature importance for LightGBM and LSTM models. [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention visualization for DistilBERT model [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Attention visualization for RoBERTa model. Complex Negations:Double negation and cross-casual negation were difficult to be classified, hence were often misclassified due to its complicated nature. For example, the phrase “not without its advantage” was often categorized as negative despite expressing positive sentiment. Sarcasm and Irony: Models demonstrated limited capabil￾ities when it came to detecting… view at source ↗
read the original abstract

Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts an empirical comparison of machine learning models for binary sentiment classification on the IMDb movie review dataset. It evaluates Naive Bayes, Logistic Regression, SVM, LightGBM, LSTM, DistilBERT, and RoBERTa, reporting that RoBERTa achieves the highest accuracy of 93.02% along with strong precision, recall, F1, and ROC-AUC scores. A soft-voting ensemble combining all models is shown to yield further performance gains.

Significance. If the performance ordering is shown to be robust to equivalent hyperparameter optimization across models, the work would provide a useful, multi-metric benchmark illustrating the practical advantage of transformer architectures over classical and LSTM baselines for sentiment analysis, as well as the benefit of simple ensembling. The use of a standard public dataset and reporting of multiple evaluation metrics (accuracy, precision, recall, F1, ROC-AUC) are positive features that support reproducibility.

major comments (2)
  1. [Experimental Setup / Methods] The experimental setup section provides no description of hyperparameter search procedures, validation-set usage for model selection, learning-rate schedules, batch sizes, epoch counts, or tuning budget allocated to each model. This is load-bearing for the central claim that RoBERTa (93.02% accuracy) outperforms the others, because transformer fine-tuning typically involves more degrees of freedom than classical models; without explicit controls the observed gap could reflect optimization disparity rather than model capability.
  2. [Results] The results section reports point estimates on the standard IMDb split but includes no statistical significance tests (e.g., McNemar or bootstrap confidence intervals) comparing RoBERTa to the next-best model, nor any ablation on whether the soft-voting ensemble improvement is statistically reliable. This weakens the strength of the headline comparison and ensemble claim.
minor comments (2)
  1. [Abstract] The abstract states that 'a lot of testing' was performed but does not quantify the ensemble's improvement (e.g., the exact accuracy or F1 gain over RoBERTa alone).
  2. [Data Preprocessing] Preprocessing details (tokenization, TF-IDF vectorizer parameters, maximum sequence length for transformers) are mentioned only at a high level; adding a short table or paragraph would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's insightful comments on our empirical comparison of machine learning models for sentiment classification. We respond to each major comment below, indicating the revisions we will make to address the concerns.

read point-by-point responses
  1. Referee: [Experimental Setup / Methods] The experimental setup section provides no description of hyperparameter search procedures, validation-set usage for model selection, learning-rate schedules, batch sizes, epoch counts, or tuning budget allocated to each model. This is load-bearing for the central claim that RoBERTa (93.02% accuracy) outperforms the others, because transformer fine-tuning typically involves more degrees of freedom than classical models; without explicit controls the observed gap could reflect optimization disparity rather than model capability.

    Authors: We agree that the original manuscript omitted key details on hyperparameter procedures, which limits the ability to fully evaluate the fairness of the model comparisons. In the revised version, we will expand the experimental setup section to explicitly describe the procedures used: grid or random search with cross-validation on the training portion for classical models (Naive Bayes, Logistic Regression, SVM, LightGBM), and the specific fine-tuning settings (learning rates, batch sizes, epochs, and validation monitoring) applied to LSTM, DistilBERT, and RoBERTa following standard library recommendations. This addition will clarify the optimization approach taken and support the reported performance ordering. revision: yes

  2. Referee: [Results] The results section reports point estimates on the standard IMDb split but includes no statistical significance tests (e.g., McNemar or bootstrap confidence intervals) comparing RoBERTa to the next-best model, nor any ablation on whether the soft-voting ensemble improvement is statistically reliable. This weakens the strength of the headline comparison and ensemble claim.

    Authors: We concur that statistical tests and ensemble analysis would strengthen the results. In the revision, we will augment the results section with bootstrap confidence intervals around the reported metrics and McNemar's test to evaluate whether RoBERTa's gains over the next-best model are statistically significant. We will also add a brief ablation or comparative analysis quantifying the soft-voting ensemble's improvement relative to the best individual model, including measures of reliability. These elements will be computed on the existing splits and included in the updated manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on held-out labels

full rationale

The paper reports direct accuracy measurements (e.g., RoBERTa at 93.02%) from running standard classifiers and transformers on the fixed IMDb train/test split and comparing predictions to ground-truth labels. No equations, derivations, fitted parameters renamed as predictions, or self-citations are used to generate the headline result; performance numbers are external measurements against an independent benchmark. The analysis contains no self-definitional steps, uniqueness theorems, or ansatzes that reduce outputs to inputs by construction. This is a standard empirical comparison and remains self-contained against external dataset labels.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The evaluation rests on the assumption that standard tokenization, the public IMDb split, and off-the-shelf model implementations from libraries behave as documented in prior literature. No new entities or ad-hoc constants are introduced.

free parameters (1)
  • model-specific hyperparameters
    Learning rates, batch sizes, and fine-tuning epochs for each model are chosen but not enumerated; these are standard tunable values in transformer training.
axioms (2)
  • domain assumption IMDb labels accurately reflect binary sentiment
    The paper treats the provided positive/negative labels as ground truth without additional validation.
  • domain assumption Standard train/test split is unbiased
    Evaluation uses the conventional IMDb partition without reporting sensitivity to alternative splits.

pith-pipeline@v0.9.0 · 5754 in / 1243 out tokens · 41732 ms · 2026-05-22T06:41:05.391912+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 2 internal anchors

  1. [1]

    Implementation of keyword extraction using term frequency-inverse document frequency (tf-idf) in python,

    A. F. AlShammari, “Implementation of keyword extraction using term frequency-inverse document frequency (tf-idf) in python,”International Journal of Computer Applications, vol. 185, no. 35, pp. 9–14,

  2. [2]

    Available: https://ijcaonline.org/archives/volume185/ number35/32916-2023923137/

    [Online]. Available: https://ijcaonline.org/archives/volume185/ number35/32916-2023923137/

  3. [3]

    Multinomial naive bayes for text categorization revisited,

    A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes, “Multinomial naive bayes for text categorization revisited,” inAI 2004: Advances in Artificial Intelligence, 2005, pp. 488–499

  4. [4]

    Sentiment analysis on imdb movie reviews using hybrid feature extraction method,

    K. Kumar, B. S. Harish, and H. K. Darshan, “Sentiment analysis on imdb movie reviews using hybrid feature extraction method,”International Journal of Interactive Multimedia and Artificial Intelligence, vol. 5, pp. 109–114, 2019. [Online]. Available: https://api.semanticscholar.org/ CorpusID:59253424

  5. [5]

    Movie review analysis: Emotion analysis of imdb movie reviews,

    K. Topal and G. Ozsoyoglu, “Movie review analysis: Emotion analysis of imdb movie reviews,” inProc. IEEE/ACM Int. Conf. Advances in Social Networks Analysis and Mining (ASONAM), 2016, pp. 1170–1176

  6. [6]

    Imdb dataset of 50k movie reviews,

    L. N. Pathi, “Imdb dataset of 50k movie reviews,” https://www.kaggle. com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews, 2018, accessed: Sep. 19, 2025

  7. [7]

    An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation

    B. Das and S. Chakraborty, “An improved text sentiment classification model using tf-idf and next word negation,”arXiv, vol. abs/1806.06407,

  8. [8]

    Available: https://api.semanticscholar.org/CorpusID: 49301351

    [Online]. Available: https://api.semanticscholar.org/CorpusID: 49301351

  9. [9]

    Comparative study of static and contextual text vectorization for sentiment analysis,

    A. D. Bhargavi, “Comparative study of static and contextual text vectorization for sentiment analysis,”International Journal for Research in Applied Science and Engineering Technology, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:280216277

  10. [10]

    Comparative analysis of transformer models for sentiment classification of uk cbdc discourse on x,

    G. Kaur, S. Haraldsson, and A. Bracciali, “Comparative analysis of transformer models for sentiment classification of uk cbdc discourse on x,”Discover Analytics, vol. 3, no. 1, p. 7, 2025. [Online]. Available: https://doi.org/10.1007/s44257-025-00035-4

  11. [11]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019. [Online]. Available: https://arxiv.org/abs/1907.11692

  12. [12]

    A unified approach to interpreting model predictions,

    S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inProc. Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 4768–4777

  13. [13]

    Bert meets shapley: Extending shap explanations to transformer-based classifiers,

    E. Kokalj, B. ˇSkrlj, N. Lavra ˇc, S. Pollak, and M. Robnik- ˇSikonja, “Bert meets shapley: Extending shap explanations to transformer-based classifiers,” inProc. EACL Hackashop on News Media Content Analysis and Automated Report Generation, 2021, pp. 16–21. [Online]. Available: https://aclanthology.org/2021.hackashop-1.3/

  14. [14]

    A novel transformer-based explainable ai approach using shap for intrusion detection in vehicular ad hoc networks,

    W. Khan, J. Ahmad, N. Alasbali, A. A. Mazroa, M. S. Alshehri, and M. S. Khan, “A novel transformer-based explainable ai approach using shap for intrusion detection in vehicular ad hoc networks,” Computer Networks, vol. 270, p. 111575, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1389128625005420

  15. [15]

    Advancing sentiment analysis: Evaluating roberta against traditional and deep learning models,

    P. Pookduang, R. Klangbunrueang, W. Chansanam, and T. Lunrasri, “Advancing sentiment analysis: Evaluating roberta against traditional and deep learning models,”Engineering, Technology & Applied Science Research, 2025. [Online]. Available: https://api.semanticscholar.org/ CorpusID:276099959

  16. [16]

    Web-based sentiment analysis system using svm and tf-idf with statistical feature,

    M. Q. Huzyan Octava, D. G. Prasetyo Putri, F. M. Hilmy, U. Farooq, R. A. Nurhaliza, and G. Alfian, “Web-based sentiment analysis system using svm and tf-idf with statistical feature,” inProc. Int. Conf. Inno- vation and Intelligence for Informatics, Computing, and Technologies (3ICT), 2023, pp. 9–14

  17. [17]

    ”why should i trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i trust you?”: Explaining the predictions of any classifier,” inProc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), 2016, pp. 1135–

  18. [18]

    Why Should I Trust You?

    [Online]. Available: https://doi.org/10.1145/2939672.2939778