pith. sign in

arxiv: 2605.07811 · v1 · submitted 2026-05-08 · 💻 cs.CL

A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches for Sentiment Classification on IMDb Movie Reviews

Pith reviewed 2026-05-11 03:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords sentiment classificationIMDb movie reviewssupport vector machineBiLSTMTF-IDFclassical machine learningdeep learning comparison
0
0 comments X

The pith

Classical machine learning with SVM reaches 85.3 percent accuracy on IMDb reviews and beats the tested BiLSTM models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a direct comparison of classical machine learning and deep learning for labeling IMDb movie reviews as positive or negative. It applies TF-IDF features plus automated selection to logistic regression, naive Bayes, and SVM, then pits those results against BiLSTM and BiLSTM-with-attention networks. The classical pipeline wins, with SVM at 0.8530 accuracy while the best deep learning run reaches only 0.706. A reader would care because the outcome questions the default preference for neural networks when data and compute are moderate. The work treats classical methods as a competitive baseline rather than a fallback.

Core claim

Classical machine learning, especially SVM using TF-IDF features and PyCaret AutoML, achieves 0.8530 accuracy and outperforms both standard BiLSTM and BiLSTM with attention, which reaches 0.706 accuracy. The study shows that effective feature engineering lets simpler models capture sentiment patterns more reliably than the sequential deep learning setups under the given training conditions.

What carries the argument

The dual experimental pipeline that extracts TF-IDF features for automated classical model selection versus direct sequence modeling with bidirectional LSTMs and an attention layer.

If this is right

  • TF-IDF features paired with SVM deliver a strong, low-resource baseline for binary sentiment classification.
  • Attention layers measurably improve BiLSTM's handling of review context but still fall short of the classical result here.
  • Classical methods remain preferable when labeled data volume or compute budget is limited.
  • Deep learning models require additional optimization steps to close the gap on this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners might first test TF-IDF plus classical models on new review datasets before committing to neural architectures.
  • The gap could shrink on larger or more diverse corpora where sequential patterns become more decisive.
  • Hybrid designs that inject TF-IDF signals into neural networks might combine the strengths observed in each pipeline.

Load-bearing premise

The performance difference arises mainly from model type rather than from unequal hyperparameter tuning, training time, or hidden dataset properties between the two approaches.

What would settle it

Retraining the BiLSTM and BiLSTM-with-attention models with more extensive hyperparameter search, longer training, or added regularization and obtaining accuracy above 0.8530 on the same IMDb split would show the classical advantage was not general.

Figures

Figures reproduced from arXiv: 2605.07811 by Ardika Satria, Citra Agustin, Erma Daniar Safitri, Lia Hana Ichisasmita, Luluk Muthoharoh, Martin Clinton Tosima Manullang.

Figure 1
Figure 1. Figure 1: Distribution of sentiment labels Basic preprocessing steps include text normalization such as converting text to lowercase, removing noise (e.g., punctuation, URLs, and HTML tags), and whitespace normalization. These steps are essential to improve data quality and reduce irrelevant information Fitroh [2023]. For the machine learning (ML) pipeline, the cleaned text is transformed into numerical features usi… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy comparison of all models The accuracy comparison illustrates the relative performance differences between models. While deep learning models show competitive performance, classical machine learning models remain strong contenders, particularly when combined with effective feature engineering techniques such as TF-IDF. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model ranking based on accuracy The ranking visualization highlights the performance hierarchy among models. It can be observed that the top￾performing models achieve similar accuracy levels, suggesting that both ML and DL approaches are capable of achieving competitive results under certain conditions [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of multiple evaluation metrics [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

This paper presents a comparative study of classical machine learning and deep learning methods for sentiment classification on the IMDb movie reviews dataset. The machine learning pipeline uses TF-IDF features and PyCaret AutoML to evaluate Logistic Regression, Na\"ive Bayes, and Support Vector Machine, while the deep learning pipeline implements BiLSTM and BiLSTM with an attention mechanism. Experimental results show that classical machine learning, especially SVM, achieves the best performance with an accuracy of 0.8530, outperforming the deep learning models in this study. The BiLSTM with Attention model improves over the standard BiLSTM and reaches an accuracy of 0.706, indicating better contextual modeling. The paper concludes that although deep learning can capture sequential dependencies, classical machine learning remains a strong baseline when combined with effective feature engineering such as TF-IDF, particularly under limited data and computational resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical comparison of classical machine learning (Logistic Regression, Naive Bayes, and SVM using TF-IDF features with PyCaret AutoML) versus deep learning (BiLSTM and BiLSTM with attention) for binary sentiment classification on the IMDb movie reviews dataset. It reports that SVM achieves the highest accuracy of 0.8530 while the deep learning models reach only 0.706 accuracy, concluding that classical methods with effective feature engineering remain strong baselines under limited resources.

Significance. If the deep learning models had received optimization effort comparable to the AutoML-tuned classical pipeline, the result would usefully illustrate that TF-IDF-based classical models can outperform basic recurrent networks on this task. The explicit use of PyCaret for the classical side is a strength for reproducibility of that portion of the comparison.

major comments (3)
  1. [Deep Learning Models subsection] Deep Learning Models subsection: The reported BiLSTM accuracy of 0.706 is atypically low for the IMDb dataset (standard configurations with 128-dim embeddings, Adam optimizer, 5–10 epochs and dropout routinely exceed 0.85). Without any reported details on optimizer, batch size, number of epochs, embedding initialization, learning-rate schedule, early stopping, or hyperparameter search budget, the performance gap cannot be attributed to model class rather than unequal optimization effort.
  2. [Results section] Results section (Table 1 or equivalent): No error bars, multiple random seeds, or statistical significance tests are provided for the accuracy differences; the claim that SVM 'outperforms' the deep models therefore rests on single-point estimates whose reliability cannot be assessed.
  3. [Experimental Setup] Experimental Setup: The manuscript provides no information on the train-test split (standard 25k/25k or otherwise), vocabulary size, or preprocessing pipeline applied to the deep learning models, preventing direct comparison of feature engineering effort between the two paradigms.
minor comments (2)
  1. [Abstract] Abstract: The LaTeX rendering of 'Naïve Bayes' is inconsistent with plain-text usage elsewhere; standardize formatting.
  2. [Conclusion] Conclusion: The statement that deep learning 'can capture sequential dependencies' is not supported by any qualitative analysis or error-case examination in the current results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [Deep Learning Models subsection] Deep Learning Models subsection: The reported BiLSTM accuracy of 0.706 is atypically low for the IMDb dataset (standard configurations with 128-dim embeddings, Adam optimizer, 5–10 epochs and dropout routinely exceed 0.85). Without any reported details on optimizer, batch size, number of epochs, embedding initialization, learning-rate schedule, early stopping, or hyperparameter search budget, the performance gap cannot be attributed to model class rather than unequal optimization effort.

    Authors: We appreciate the referee highlighting this issue. The BiLSTM and BiLSTM-with-attention models were implemented using basic, standard configurations without extensive hyperparameter tuning or optimization budgets, consistent with our stated focus on limited computational resources in the conclusion. This was intentional to contrast against the AutoML-optimized classical pipeline. We will revise the Deep Learning Models subsection to provide full implementation details on optimizer, batch size, epochs, embedding initialization, learning rate, dropout, and related settings to improve reproducibility and allow readers to evaluate the optimization effort. revision: yes

  2. Referee: [Results section] Results section (Table 1 or equivalent): No error bars, multiple random seeds, or statistical significance tests are provided for the accuracy differences; the claim that SVM 'outperforms' the deep models therefore rests on single-point estimates whose reliability cannot be assessed.

    Authors: We agree that single-run accuracy figures limit the strength of the performance claims. We will update the Results section to report means and standard deviations from multiple random seeds and include statistical significance testing for the observed differences. revision: yes

  3. Referee: [Experimental Setup] Experimental Setup: The manuscript provides no information on the train-test split (standard 25k/25k or otherwise), vocabulary size, or preprocessing pipeline applied to the deep learning models, preventing direct comparison of feature engineering effort between the two paradigms.

    Authors: We apologize for the missing details. We will expand the Experimental Setup section to explicitly state the train-test split used, the vocabulary size, and the full preprocessing pipeline (including tokenization and padding) applied to the deep learning models. This will enable direct comparison of feature engineering between the classical and deep learning approaches. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with measured results

full rationale

The paper conducts a direct experimental comparison of classical ML (TF-IDF + PyCaret AutoML for LR, NB, SVM) and DL (BiLSTM, BiLSTM+Attention) models on the IMDb dataset, reporting observed accuracies such as SVM at 0.8530 and BiLSTM at 0.706. There are no derivations, equations, fitted parameters renamed as predictions, self-citations used as load-bearing premises, or any self-definitional steps. All performance figures are external measurement outcomes from running the models, not quantities defined or forced by the paper's own assumptions or prior self-work. The work is self-contained against external benchmarks and contains no mathematical chain that reduces to its inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The claim rests on standard supervised learning assumptions and the representativeness of the IMDb dataset for sentiment tasks. No new entities are introduced. Free parameters are the usual model hyperparameters and feature-engineering choices optimized during training.

free parameters (2)
  • TF-IDF configuration (max features, n-gram range, etc.)
    Selected during feature extraction and likely tuned by PyCaret AutoML.
  • SVM and BiLSTM hyperparameters
    Optimized via AutoML or training procedure; values not reported in abstract.
axioms (2)
  • domain assumption IMDb reviews are representative of general English-language sentiment classification problems.
    Implicit in treating the dataset as a sufficient benchmark.
  • standard math Standard i.i.d. sampling and supervised learning assumptions hold for the experimental setup.
    Underlying all reported accuracy measurements.

pith-pipeline@v0.9.0 · 5474 in / 1366 out tokens · 54290 ms · 2026-05-11T03:05:19.199556+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Ramadhan, N. G. and Ramadhan, T. I. , title =. Sinkron Journal , year =

  2. [2]

    and others , title =

    Subedi, D. and others , title =. Journal of Engineering and Sciences , year =

  3. [3]

    and Tedmori, S

    Yasen, M. and Tedmori, S. , title =. International Journal of Computer Applications , year =

  4. [4]

    Dang, N. C. and others , title =. Electronics , volume =

  5. [5]

    and Singla, N

    Singh, S. and Singla, N. , title =. International Journal of Engineering Research and Technology , year =

  6. [6]

    and others , title =

    Gupta, R. and others , title =. IEEE Access , year =

  7. [7]

    and others , title =

    Dong, Z. and others , title =. IEEE Transactions , year =

  8. [8]

    , title =

    Fitroh, F. , title =. Journal of Informatics , year =

  9. [9]

    and others , title =

    Tarimer, I. and others , title =. International Journal of Data Science , year =

  10. [10]

    and others , title =

    Walji, A. and others , title =. Journal of AI Research , year =

  11. [11]

    and Raghavan, Prabhakar and Schütze, Hinrich , title =

    Manning, Christopher D. and Raghavan, Prabhakar and Schütze, Hinrich , title =

  12. [12]

    Powers, David M. W. , title =. Journal of Machine Learning Technologies , year =

  13. [13]

    Neural Computation , volume =

    Hochreiter, Sepp and Schmidhuber, Jürgen , title =. Neural Computation , volume =

  14. [14]

    , title =

    Schuster, Mike and Paliwal, Kuldip K. , title =. IEEE Transactions on Signal Processing , year =

  15. [15]

    ICLR , year =

    Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua , title =. ICLR , year =

  16. [16]

    N. G. Ramadhan and T. I. Ramadhan. Analysis sentiment based on imdb aspects from movie reviews using svm. Sinkron Journal, 2022

  17. [17]

    Subedi et al

    D. Subedi et al. Sentiment analysis of imdb movie reviews using svm and naive bayes classifier. Journal of Engineering and Sciences, 2025

  18. [18]

    Yasen and S

    M. Yasen and S. Tedmori. Movie reviews sentiment analysis and classification. International Journal of Computer Applications, 2019

  19. [19]

    N. C. Dang et al. Sentiment analysis based on deep learning: A comparative study. Electronics, 9 0 (3), 2020

  20. [20]

    Singh and N

    S. Singh and N. Singla. Sentiment analysis on imdb review dataset. International Journal of Engineering Research and Technology, 2023

  21. [21]

    Gupta et al

    R. Gupta et al. Advancements in sentiment analysis using deep learning. IEEE Access, 2024

  22. [22]

    Walji et al

    A. Walji et al. Comparative study of machine learning and deep learning in nlp. Journal of AI Research, 2025

  23. [23]

    Dong et al

    Z. Dong et al. A survey on sentiment analysis. IEEE Transactions, 2020

  24. [24]

    F. Fitroh. Sentiment analysis using machine learning approaches. Journal of Informatics, 2023

  25. [25]

    Tarimer et al

    I. Tarimer et al. Text classification using machine learning and deep learning. International Journal of Data Science, 2024

  26. [26]

    Manning, Prabhakar Raghavan, and Hinrich Schütze

    Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008

  27. [27]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997

  28. [28]

    Mike Schuster and Kuldip K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997

  29. [29]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015