A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches for Sentiment Classification on IMDb Movie Reviews
Pith reviewed 2026-05-11 03:05 UTC · model grok-4.3
The pith
Classical machine learning with SVM reaches 85.3 percent accuracy on IMDb reviews and beats the tested BiLSTM models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Classical machine learning, especially SVM using TF-IDF features and PyCaret AutoML, achieves 0.8530 accuracy and outperforms both standard BiLSTM and BiLSTM with attention, which reaches 0.706 accuracy. The study shows that effective feature engineering lets simpler models capture sentiment patterns more reliably than the sequential deep learning setups under the given training conditions.
What carries the argument
The dual experimental pipeline that extracts TF-IDF features for automated classical model selection versus direct sequence modeling with bidirectional LSTMs and an attention layer.
If this is right
- TF-IDF features paired with SVM deliver a strong, low-resource baseline for binary sentiment classification.
- Attention layers measurably improve BiLSTM's handling of review context but still fall short of the classical result here.
- Classical methods remain preferable when labeled data volume or compute budget is limited.
- Deep learning models require additional optimization steps to close the gap on this task.
Where Pith is reading between the lines
- Practitioners might first test TF-IDF plus classical models on new review datasets before committing to neural architectures.
- The gap could shrink on larger or more diverse corpora where sequential patterns become more decisive.
- Hybrid designs that inject TF-IDF signals into neural networks might combine the strengths observed in each pipeline.
Load-bearing premise
The performance difference arises mainly from model type rather than from unequal hyperparameter tuning, training time, or hidden dataset properties between the two approaches.
What would settle it
Retraining the BiLSTM and BiLSTM-with-attention models with more extensive hyperparameter search, longer training, or added regularization and obtaining accuracy above 0.8530 on the same IMDb split would show the classical advantage was not general.
Figures
read the original abstract
This paper presents a comparative study of classical machine learning and deep learning methods for sentiment classification on the IMDb movie reviews dataset. The machine learning pipeline uses TF-IDF features and PyCaret AutoML to evaluate Logistic Regression, Na\"ive Bayes, and Support Vector Machine, while the deep learning pipeline implements BiLSTM and BiLSTM with an attention mechanism. Experimental results show that classical machine learning, especially SVM, achieves the best performance with an accuracy of 0.8530, outperforming the deep learning models in this study. The BiLSTM with Attention model improves over the standard BiLSTM and reaches an accuracy of 0.706, indicating better contextual modeling. The paper concludes that although deep learning can capture sequential dependencies, classical machine learning remains a strong baseline when combined with effective feature engineering such as TF-IDF, particularly under limited data and computational resources.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical comparison of classical machine learning (Logistic Regression, Naive Bayes, and SVM using TF-IDF features with PyCaret AutoML) versus deep learning (BiLSTM and BiLSTM with attention) for binary sentiment classification on the IMDb movie reviews dataset. It reports that SVM achieves the highest accuracy of 0.8530 while the deep learning models reach only 0.706 accuracy, concluding that classical methods with effective feature engineering remain strong baselines under limited resources.
Significance. If the deep learning models had received optimization effort comparable to the AutoML-tuned classical pipeline, the result would usefully illustrate that TF-IDF-based classical models can outperform basic recurrent networks on this task. The explicit use of PyCaret for the classical side is a strength for reproducibility of that portion of the comparison.
major comments (3)
- [Deep Learning Models subsection] Deep Learning Models subsection: The reported BiLSTM accuracy of 0.706 is atypically low for the IMDb dataset (standard configurations with 128-dim embeddings, Adam optimizer, 5–10 epochs and dropout routinely exceed 0.85). Without any reported details on optimizer, batch size, number of epochs, embedding initialization, learning-rate schedule, early stopping, or hyperparameter search budget, the performance gap cannot be attributed to model class rather than unequal optimization effort.
- [Results section] Results section (Table 1 or equivalent): No error bars, multiple random seeds, or statistical significance tests are provided for the accuracy differences; the claim that SVM 'outperforms' the deep models therefore rests on single-point estimates whose reliability cannot be assessed.
- [Experimental Setup] Experimental Setup: The manuscript provides no information on the train-test split (standard 25k/25k or otherwise), vocabulary size, or preprocessing pipeline applied to the deep learning models, preventing direct comparison of feature engineering effort between the two paradigms.
minor comments (2)
- [Abstract] Abstract: The LaTeX rendering of 'Naïve Bayes' is inconsistent with plain-text usage elsewhere; standardize formatting.
- [Conclusion] Conclusion: The statement that deep learning 'can capture sequential dependencies' is not supported by any qualitative analysis or error-case examination in the current results.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [Deep Learning Models subsection] Deep Learning Models subsection: The reported BiLSTM accuracy of 0.706 is atypically low for the IMDb dataset (standard configurations with 128-dim embeddings, Adam optimizer, 5–10 epochs and dropout routinely exceed 0.85). Without any reported details on optimizer, batch size, number of epochs, embedding initialization, learning-rate schedule, early stopping, or hyperparameter search budget, the performance gap cannot be attributed to model class rather than unequal optimization effort.
Authors: We appreciate the referee highlighting this issue. The BiLSTM and BiLSTM-with-attention models were implemented using basic, standard configurations without extensive hyperparameter tuning or optimization budgets, consistent with our stated focus on limited computational resources in the conclusion. This was intentional to contrast against the AutoML-optimized classical pipeline. We will revise the Deep Learning Models subsection to provide full implementation details on optimizer, batch size, epochs, embedding initialization, learning rate, dropout, and related settings to improve reproducibility and allow readers to evaluate the optimization effort. revision: yes
-
Referee: [Results section] Results section (Table 1 or equivalent): No error bars, multiple random seeds, or statistical significance tests are provided for the accuracy differences; the claim that SVM 'outperforms' the deep models therefore rests on single-point estimates whose reliability cannot be assessed.
Authors: We agree that single-run accuracy figures limit the strength of the performance claims. We will update the Results section to report means and standard deviations from multiple random seeds and include statistical significance testing for the observed differences. revision: yes
-
Referee: [Experimental Setup] Experimental Setup: The manuscript provides no information on the train-test split (standard 25k/25k or otherwise), vocabulary size, or preprocessing pipeline applied to the deep learning models, preventing direct comparison of feature engineering effort between the two paradigms.
Authors: We apologize for the missing details. We will expand the Experimental Setup section to explicitly state the train-test split used, the vocabulary size, and the full preprocessing pipeline (including tokenization and padding) applied to the deep learning models. This will enable direct comparison of feature engineering between the classical and deep learning approaches. revision: yes
Circularity Check
No circularity: purely empirical comparison with measured results
full rationale
The paper conducts a direct experimental comparison of classical ML (TF-IDF + PyCaret AutoML for LR, NB, SVM) and DL (BiLSTM, BiLSTM+Attention) models on the IMDb dataset, reporting observed accuracies such as SVM at 0.8530 and BiLSTM at 0.706. There are no derivations, equations, fitted parameters renamed as predictions, self-citations used as load-bearing premises, or any self-definitional steps. All performance figures are external measurement outcomes from running the models, not quantities defined or forced by the paper's own assumptions or prior self-work. The work is self-contained against external benchmarks and contains no mathematical chain that reduces to its inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- TF-IDF configuration (max features, n-gram range, etc.)
- SVM and BiLSTM hyperparameters
axioms (2)
- domain assumption IMDb reviews are representative of general English-language sentiment classification problems.
- standard math Standard i.i.d. sampling and supervised learning assumptions hold for the experimental setup.
Reference graph
Works this paper leans on
-
[1]
Ramadhan, N. G. and Ramadhan, T. I. , title =. Sinkron Journal , year =
-
[2]
Subedi, D. and others , title =. Journal of Engineering and Sciences , year =
-
[3]
Yasen, M. and Tedmori, S. , title =. International Journal of Computer Applications , year =
-
[4]
Dang, N. C. and others , title =. Electronics , volume =
-
[5]
Singh, S. and Singla, N. , title =. International Journal of Engineering Research and Technology , year =
- [6]
- [7]
- [8]
-
[9]
Tarimer, I. and others , title =. International Journal of Data Science , year =
- [10]
-
[11]
and Raghavan, Prabhakar and Schütze, Hinrich , title =
Manning, Christopher D. and Raghavan, Prabhakar and Schütze, Hinrich , title =
-
[12]
Powers, David M. W. , title =. Journal of Machine Learning Technologies , year =
-
[13]
Hochreiter, Sepp and Schmidhuber, Jürgen , title =. Neural Computation , volume =
- [14]
-
[15]
Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua , title =. ICLR , year =
-
[16]
N. G. Ramadhan and T. I. Ramadhan. Analysis sentiment based on imdb aspects from movie reviews using svm. Sinkron Journal, 2022
work page 2022
-
[17]
D. Subedi et al. Sentiment analysis of imdb movie reviews using svm and naive bayes classifier. Journal of Engineering and Sciences, 2025
work page 2025
-
[18]
M. Yasen and S. Tedmori. Movie reviews sentiment analysis and classification. International Journal of Computer Applications, 2019
work page 2019
-
[19]
N. C. Dang et al. Sentiment analysis based on deep learning: A comparative study. Electronics, 9 0 (3), 2020
work page 2020
-
[20]
S. Singh and N. Singla. Sentiment analysis on imdb review dataset. International Journal of Engineering Research and Technology, 2023
work page 2023
-
[21]
R. Gupta et al. Advancements in sentiment analysis using deep learning. IEEE Access, 2024
work page 2024
-
[22]
A. Walji et al. Comparative study of machine learning and deep learning in nlp. Journal of AI Research, 2025
work page 2025
- [23]
-
[24]
F. Fitroh. Sentiment analysis using machine learning approaches. Journal of Informatics, 2023
work page 2023
-
[25]
I. Tarimer et al. Text classification using machine learning and deep learning. International Journal of Data Science, 2024
work page 2024
-
[26]
Manning, Prabhakar Raghavan, and Hinrich Schütze
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008
work page 2008
-
[27]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997
work page 1997
-
[28]
Mike Schuster and Kuldip K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997
work page 1997
-
[29]
Neural machine translation by jointly learning to align and translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.