A Comparative Analysis of Machine Learning and Deep Learning Models for Tweet Sentiment Classification: A Case Study on the Sentiment140 Dataset

Ardika Satria; Bastian; Cintya Bella; Luluk Muthoharoh; Martin C.T. Manullang; Vita Anggraini

arxiv: 2605.04888 · v1 · submitted 2026-05-06 · 💻 cs.CL

A Comparative Analysis of Machine Learning and Deep Learning Models for Tweet Sentiment Classification: A Case Study on the Sentiment140 Dataset

Vita Anggraini , Cintya Bella , Bastian , Luluk Muthoharoh , Ardika Satria , Martin C.T. Manullang This is my paper

Pith reviewed 2026-05-08 16:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords tweet sentiment classificationlogistic regressionBiLSTMSentiment140 datasetTF-IDF featuresmachine learningdeep learningtext classification

0 comments

The pith

Logistic regression with TF-IDF features reached 73.5 percent accuracy on tweet sentiment classification, outperforming BiLSTM at 69.17 percent on a 10,000-tweet subset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares a logistic regression model that relies on TF-IDF vectorization against a bidirectional LSTM neural network for labeling the positive or negative sentiment of tweets. On a 10,000-tweet slice of the Sentiment140 collection the simpler model produced higher accuracy and avoided the mild overfitting that appeared in the deep learning version. The result matters for anyone building real-time social media monitors, because it shows that careful feature engineering can remain competitive with more elaborate architectures when the data volume stays moderate. The authors further packaged the models into a Streamlit web application hosted on Hugging Face Spaces. A reader who accepts the comparison may therefore treat classical methods as a practical first choice for informal short-text tasks of this scale rather than defaulting to deep learning.

Core claim

The study establishes that logistic regression using TF-IDF features outperforms a BiLSTM architecture by achieving 73.5 percent accuracy compared with 69.17 percent on a 10,000-tweet subset of the Sentiment140 dataset, while the deep learning model exhibits mild overfitting. The authors conclude that classical machine learning combined with robust feature extraction can outperform more complex deep learning approaches when the data consists of medium-scale informal text.

What carries the argument

The head-to-head experimental comparison of a TF-IDF logistic regression classifier against a bidirectional LSTM network for binary sentiment labeling on the same fixed tweet subset.

If this is right

For informal text datasets of comparable size, classical machine learning with TF-IDF can deliver higher accuracy than bidirectional LSTM models.
Bidirectional LSTM networks applied to tweet data of this scale are prone to mild overfitting.
Trained sentiment classifiers can be packaged into interactive web applications for immediate public use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern might reverse on substantially larger tweet collections once deep learning models receive enough examples to overcome overfitting.
Practitioners building lightweight real-time sentiment dashboards may prefer the logistic regression route for lower compute cost and easier interpretability.
Repeating the experiment with multi-class labels or non-English tweets would clarify whether the advantage of the simpler model generalizes beyond binary English sentiment.

Load-bearing premise

The 10,000-tweet subset and the particular model configurations supply a fair and representative test of classical versus deep learning performance on informal social media text.

What would settle it

Retraining both models on the full Sentiment140 collection or on a larger random subset while reporting hyperparameter search and cross-validation results, then checking whether BiLSTM accuracy exceeds 73.5 percent without increased overfitting.

Figures

Figures reproduced from arXiv: 2605.04888 by Ardika Satria, Bastian, Cintya Bella, Luluk Muthoharoh, Martin C.T. Manullang, Vita Anggraini.

**Figure 1.** Figure 1: Confusion Matrix for the Logistic Regression (Baseline) Model. view at source ↗

**Figure 2.** Figure 2: Confusion Matrix for the BiLSTM Model. 6 view at source ↗

**Figure 3.** Figure 3: Comparative Analysis of Learning Curves: (Left) Machine Learning Generalization, (Right) Deep Learning view at source ↗

read the original abstract

The exponential growth of social media has created an urgent need for automated systems to analyze unstructured public sentiment in real time. This study compares a traditional Logistic Regression model using TF-IDF features with a deep learning Bidirectional Long Short-Term Memory (BiLSTM) architecture on a 10,000-tweet subset of the Sentiment140 dataset. Experimental results show that Logistic Regression outperformed BiLSTM, achieving an accuracy of 73.5% compared with 69.17%, while the deep learning model exhibited mild overfitting. These findings suggest that for medium-scale informal text data, classical machine learning with robust feature extraction can outperform more complex deep learning approaches. Finally, the trained models were integrated into an interactive web application using Streamlit and deployed on Hugging Face Spaces for public access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The manuscript compares a Logistic Regression classifier using TF-IDF features against a BiLSTM model for binary sentiment classification on a 10,000-tweet subset of the Sentiment140 dataset. It reports that Logistic Regression achieves 73.5% accuracy while BiLSTM reaches 69.17%, with the latter exhibiting mild overfitting, and concludes that classical ML approaches can outperform deep learning for medium-scale informal text. The trained models are deployed in a Streamlit application hosted on Hugging Face Spaces.

Significance. If the experimental comparison were shown to be fair and reproducible, the result would provide a useful counter-example to the assumption that deep learning is invariably superior for tweet-level sentiment analysis on datasets of this size. It would strengthen the case for evaluating simple feature-based baselines before adopting more complex architectures, particularly in resource-constrained or real-time settings. The deployment component adds modest practical value but does not alter the core empirical claim.

major comments (4)

[Experimental Results] Experimental Results section: The reported accuracies (73.5% for LR, 69.17% for BiLSTM) are presented without any description of the train/test split, the size of the training set, whether stratified sampling was used, or any form of cross-validation. This omission prevents verification that the performance gap is not an artifact of a single fixed partition.
[Model Architectures and Training] BiLSTM model description and training procedure: No information is supplied on the BiLSTM architecture (number of layers, hidden dimension, embedding size), regularization (dropout, weight decay), optimizer, learning-rate schedule, batch size, or number of epochs. The claim of 'mild overfitting' is therefore unsupported by any training/validation curves or quantitative evidence.
[Experimental Results] Statistical comparison: The manuscript asserts that Logistic Regression outperformed BiLSTM but provides no statistical test (e.g., McNemar’s test, bootstrap confidence intervals, or paired t-test across multiple runs) to establish whether the 4.33 percentage-point difference is significant or could arise from random variation in a single run.
[Model Architectures and Training] Hyperparameter selection: The paper states that BiLSTM exhibited mild overfitting yet reports no hyperparameter search (grid, random, or Bayesian) or even default values used for the deep model. Without this, the conclusion that 'classical machine learning can outperform deep learning' rests on an untested assumption that the chosen BiLSTM configuration was near-optimal.

minor comments (2)

[Abstract and Conclusion] The abstract and conclusion refer to 'medium-scale informal text data' without defining the scale threshold or citing prior work that establishes when DL typically begins to outperform TF-IDF baselines.
[Data Preprocessing] Preprocessing steps (tokenization, stop-word removal, handling of URLs/mentions) are mentioned only in passing; a concise table or paragraph listing exact steps would improve reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We are grateful to the referee for the insightful comments that highlight important aspects of experimental rigor. We have carefully considered each major comment and will revise the manuscript to enhance reproducibility and strengthen the empirical claims. Below we provide point-by-point responses.

read point-by-point responses

Referee: Experimental Results section: The reported accuracies (73.5% for LR, 69.17% for BiLSTM) are presented without any description of the train/test split, the size of the training set, whether stratified sampling was used, or any form of cross-validation. This omission prevents verification that the performance gap is not an artifact of a single fixed partition.

Authors: We agree that these details are essential for reproducibility. The original manuscript omitted them inadvertently. In the revised version, we will describe the data partitioning process in full, including the train/test split ratio and sizes, confirmation of stratified sampling to maintain class distribution, and note that cross-validation was not performed. We will also discuss the rationale and potential limitations of using a single split. revision: yes
Referee: BiLSTM model description and training procedure: No information is supplied on the BiLSTM architecture (number of layers, hidden dimension, embedding size), regularization (dropout, weight decay), optimizer, learning-rate schedule, batch size, or number of epochs. The claim of 'mild overfitting' is therefore unsupported by any training/validation curves or quantitative evidence.

Authors: We acknowledge this gap in the model description. The revised manuscript will include a complete specification of the BiLSTM architecture, including the number of layers, hidden dimensions, embedding size, dropout rates, optimizer choice, learning rate, batch size, and number of training epochs. Furthermore, we will add plots of training and validation loss/accuracy curves to provide quantitative support for the mild overfitting observation. revision: yes
Referee: Statistical comparison: The manuscript asserts that Logistic Regression outperformed BiLSTM but provides no statistical test (e.g., McNemar’s test, bootstrap confidence intervals, or paired t-test across multiple runs) to establish whether the 4.33 percentage-point difference is significant or could arise from random variation in a single run.

Authors: We concur that statistical significance testing would bolster the comparison. Since the current results are from single runs, we will augment the experiments by conducting multiple independent runs with different random seeds and report mean accuracies with standard deviations. Alternatively, we will compute bootstrap confidence intervals for the accuracy difference on the test set. This analysis will be incorporated into the revised Experimental Results section. revision: yes
Referee: Hyperparameter selection: The paper states that BiLSTM exhibited mild overfitting yet reports no hyperparameter search (grid, random, or Bayesian) or even default values used for the deep model. Without this, the conclusion that 'classical machine learning can outperform deep learning' rests on an untested assumption that the chosen BiLSTM configuration was near-optimal.

Authors: We appreciate this critique. The BiLSTM was implemented using standard library defaults without a dedicated hyperparameter optimization procedure, which we will now explicitly document in the revised paper. We will also clarify that our intent was to compare a straightforward deep learning baseline against a well-tuned classical model, rather than claiming superiority over an optimally tuned deep model. The conclusion will be nuanced to reflect this, emphasizing the value of simple baselines even when deep models are not extensively tuned. revision: partial

Circularity Check

0 steps flagged

Empirical comparison with no derivations or self-referential claims

full rationale

The paper is a straightforward experimental study comparing Logistic Regression (with TF-IDF) and BiLSTM on a 10k-tweet subset of Sentiment140. It reports observed accuracies (73.5% vs 69.17%) and notes mild overfitting in the DL model, but contains no equations, no first-principles derivations, no fitted parameters renamed as predictions, and no load-bearing self-citations. The central claim rests on direct benchmark results against an external dataset and is therefore self-contained with no reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a purely empirical machine-learning comparison paper. It relies on standard supervised learning assumptions but introduces no free parameters, new axioms, or invented entities beyond the two off-the-shelf models and the TF-IDF representation.

axioms (1)

domain assumption Tweets in the Sentiment140 subset are independent and identically distributed for training and evaluation
Implicit in any train/test split for classification performance reporting

pith-pipeline@v0.9.0 · 5455 in / 1140 out tokens · 27321 ms · 2026-05-08T16:45:41.477609+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Effat University Publication, 2020

Anil Rajput.Natural Language Processing, Sentiment Analysis and Clinical Analytics. Effat University Publication, 2020

work page 2020
[2]

M. N. Muttaqin and I. Kharisudin. Analisis sentimen aplikasi gojek menggunakan support vector machine dan k-nearest neighbor.UNNES Journal of Mathematics, 10(1):22–27, 2021

work page 2021
[3]

Semeval-2016 task 4: Sentiment analysis in twitter

Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoyanov. Semeval-2016 task 4: Sentiment analysis in twitter. InProceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1–18, 2016

work page 2016
[4]

Term-weighting approaches in automatic text retrieval.Information Processing & Management, 24(5):513–523, 1988

Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval.Information Processing & Management, 24(5):513–523, 1988

work page 1988
[5]

Hendriyana, I

H. Hendriyana, I. M. Karo Karo, and S. Dewi. Analisis perbandingan algoritma support vector machine, naive bayes, dan regresi logistik untuk memprediksi donor darah.Jurnal Teknologi Terpadu, 8(2):121–126, 2022

work page 2022
[6]

Long short-term memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

work page 1997
[7]

D. R. Alghifari, M. Edi, and L. Firmansyah. Implementasi bidirectional lstm untuk analisis sentimen terhadap layanan grab indonesia.Jurnal Manajemen Informatika (JAMIKA), 12(2):89–100, 2022

work page 2022
[8]

Bidirectional recurrent neural networks.IEEE transactions on Signal Processing, 45(11):2673–2681, 1997

Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks.IEEE transactions on Signal Processing, 45(11):2673–2681, 1997

work page 1997
[9]

Twitter sentiment classification using distant supervision.CS224N Project Report, Stanford, 1(12), 2009

Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision.CS224N Project Report, Stanford, 1(12), 2009

work page 2009
[10]

Elmaliyasari, M

S. Elmaliyasari, M. A. Alzam, N. A. Pratiwi, S. S. M. Wara, and K. M. Hindrayani. Deteksi sentimen komentar aplikasi gobis suroboyo dengan metode naive bayes dan metode regresi logistik.JDMIS: Journal of Data Mining and Information Systems, 3(2):108–116, 2025

work page 2025
[11]

Scikit-learn: Machine learning in python.The Journal of Machine Learning Research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, et al. Scikit-learn: Machine learning in python.The Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011
[12]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, et al. Pytorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Systems, volume 32, 2019. 8

work page 2019

[1] [1]

Effat University Publication, 2020

Anil Rajput.Natural Language Processing, Sentiment Analysis and Clinical Analytics. Effat University Publication, 2020

work page 2020

[2] [2]

M. N. Muttaqin and I. Kharisudin. Analisis sentimen aplikasi gojek menggunakan support vector machine dan k-nearest neighbor.UNNES Journal of Mathematics, 10(1):22–27, 2021

work page 2021

[3] [3]

Semeval-2016 task 4: Sentiment analysis in twitter

Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoyanov. Semeval-2016 task 4: Sentiment analysis in twitter. InProceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1–18, 2016

work page 2016

[4] [4]

Term-weighting approaches in automatic text retrieval.Information Processing & Management, 24(5):513–523, 1988

Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval.Information Processing & Management, 24(5):513–523, 1988

work page 1988

[5] [5]

Hendriyana, I

H. Hendriyana, I. M. Karo Karo, and S. Dewi. Analisis perbandingan algoritma support vector machine, naive bayes, dan regresi logistik untuk memprediksi donor darah.Jurnal Teknologi Terpadu, 8(2):121–126, 2022

work page 2022

[6] [6]

Long short-term memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

work page 1997

[7] [7]

D. R. Alghifari, M. Edi, and L. Firmansyah. Implementasi bidirectional lstm untuk analisis sentimen terhadap layanan grab indonesia.Jurnal Manajemen Informatika (JAMIKA), 12(2):89–100, 2022

work page 2022

[8] [8]

Bidirectional recurrent neural networks.IEEE transactions on Signal Processing, 45(11):2673–2681, 1997

Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks.IEEE transactions on Signal Processing, 45(11):2673–2681, 1997

work page 1997

[9] [9]

Twitter sentiment classification using distant supervision.CS224N Project Report, Stanford, 1(12), 2009

Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision.CS224N Project Report, Stanford, 1(12), 2009

work page 2009

[10] [10]

Elmaliyasari, M

S. Elmaliyasari, M. A. Alzam, N. A. Pratiwi, S. S. M. Wara, and K. M. Hindrayani. Deteksi sentimen komentar aplikasi gobis suroboyo dengan metode naive bayes dan metode regresi logistik.JDMIS: Journal of Data Mining and Information Systems, 3(2):108–116, 2025

work page 2025

[11] [11]

Scikit-learn: Machine learning in python.The Journal of Machine Learning Research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, et al. Scikit-learn: Machine learning in python.The Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011

[12] [12]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, et al. Pytorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Systems, volume 32, 2019. 8

work page 2019