A Comparative Analysis of Machine Learning and Deep Learning Models for Tweet Sentiment Classification: A Case Study on the Sentiment140 Dataset
Pith reviewed 2026-05-08 16:45 UTC · model grok-4.3
The pith
Logistic regression with TF-IDF features reached 73.5 percent accuracy on tweet sentiment classification, outperforming BiLSTM at 69.17 percent on a 10,000-tweet subset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study establishes that logistic regression using TF-IDF features outperforms a BiLSTM architecture by achieving 73.5 percent accuracy compared with 69.17 percent on a 10,000-tweet subset of the Sentiment140 dataset, while the deep learning model exhibits mild overfitting. The authors conclude that classical machine learning combined with robust feature extraction can outperform more complex deep learning approaches when the data consists of medium-scale informal text.
What carries the argument
The head-to-head experimental comparison of a TF-IDF logistic regression classifier against a bidirectional LSTM network for binary sentiment labeling on the same fixed tweet subset.
If this is right
- For informal text datasets of comparable size, classical machine learning with TF-IDF can deliver higher accuracy than bidirectional LSTM models.
- Bidirectional LSTM networks applied to tweet data of this scale are prone to mild overfitting.
- Trained sentiment classifiers can be packaged into interactive web applications for immediate public use.
Where Pith is reading between the lines
- The same pattern might reverse on substantially larger tweet collections once deep learning models receive enough examples to overcome overfitting.
- Practitioners building lightweight real-time sentiment dashboards may prefer the logistic regression route for lower compute cost and easier interpretability.
- Repeating the experiment with multi-class labels or non-English tweets would clarify whether the advantage of the simpler model generalizes beyond binary English sentiment.
Load-bearing premise
The 10,000-tweet subset and the particular model configurations supply a fair and representative test of classical versus deep learning performance on informal social media text.
What would settle it
Retraining both models on the full Sentiment140 collection or on a larger random subset while reporting hyperparameter search and cross-validation results, then checking whether BiLSTM accuracy exceeds 73.5 percent without increased overfitting.
Figures
read the original abstract
The exponential growth of social media has created an urgent need for automated systems to analyze unstructured public sentiment in real time. This study compares a traditional Logistic Regression model using TF-IDF features with a deep learning Bidirectional Long Short-Term Memory (BiLSTM) architecture on a 10,000-tweet subset of the Sentiment140 dataset. Experimental results show that Logistic Regression outperformed BiLSTM, achieving an accuracy of 73.5% compared with 69.17%, while the deep learning model exhibited mild overfitting. These findings suggest that for medium-scale informal text data, classical machine learning with robust feature extraction can outperform more complex deep learning approaches. Finally, the trained models were integrated into an interactive web application using Streamlit and deployed on Hugging Face Spaces for public access.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares a Logistic Regression classifier using TF-IDF features against a BiLSTM model for binary sentiment classification on a 10,000-tweet subset of the Sentiment140 dataset. It reports that Logistic Regression achieves 73.5% accuracy while BiLSTM reaches 69.17%, with the latter exhibiting mild overfitting, and concludes that classical ML approaches can outperform deep learning for medium-scale informal text. The trained models are deployed in a Streamlit application hosted on Hugging Face Spaces.
Significance. If the experimental comparison were shown to be fair and reproducible, the result would provide a useful counter-example to the assumption that deep learning is invariably superior for tweet-level sentiment analysis on datasets of this size. It would strengthen the case for evaluating simple feature-based baselines before adopting more complex architectures, particularly in resource-constrained or real-time settings. The deployment component adds modest practical value but does not alter the core empirical claim.
major comments (4)
- [Experimental Results] Experimental Results section: The reported accuracies (73.5% for LR, 69.17% for BiLSTM) are presented without any description of the train/test split, the size of the training set, whether stratified sampling was used, or any form of cross-validation. This omission prevents verification that the performance gap is not an artifact of a single fixed partition.
- [Model Architectures and Training] BiLSTM model description and training procedure: No information is supplied on the BiLSTM architecture (number of layers, hidden dimension, embedding size), regularization (dropout, weight decay), optimizer, learning-rate schedule, batch size, or number of epochs. The claim of 'mild overfitting' is therefore unsupported by any training/validation curves or quantitative evidence.
- [Experimental Results] Statistical comparison: The manuscript asserts that Logistic Regression outperformed BiLSTM but provides no statistical test (e.g., McNemar’s test, bootstrap confidence intervals, or paired t-test across multiple runs) to establish whether the 4.33 percentage-point difference is significant or could arise from random variation in a single run.
- [Model Architectures and Training] Hyperparameter selection: The paper states that BiLSTM exhibited mild overfitting yet reports no hyperparameter search (grid, random, or Bayesian) or even default values used for the deep model. Without this, the conclusion that 'classical machine learning can outperform deep learning' rests on an untested assumption that the chosen BiLSTM configuration was near-optimal.
minor comments (2)
- [Abstract and Conclusion] The abstract and conclusion refer to 'medium-scale informal text data' without defining the scale threshold or citing prior work that establishes when DL typically begins to outperform TF-IDF baselines.
- [Data Preprocessing] Preprocessing steps (tokenization, stop-word removal, handling of URLs/mentions) are mentioned only in passing; a concise table or paragraph listing exact steps would improve reproducibility.
Simulated Author's Rebuttal
We are grateful to the referee for the insightful comments that highlight important aspects of experimental rigor. We have carefully considered each major comment and will revise the manuscript to enhance reproducibility and strengthen the empirical claims. Below we provide point-by-point responses.
read point-by-point responses
-
Referee: Experimental Results section: The reported accuracies (73.5% for LR, 69.17% for BiLSTM) are presented without any description of the train/test split, the size of the training set, whether stratified sampling was used, or any form of cross-validation. This omission prevents verification that the performance gap is not an artifact of a single fixed partition.
Authors: We agree that these details are essential for reproducibility. The original manuscript omitted them inadvertently. In the revised version, we will describe the data partitioning process in full, including the train/test split ratio and sizes, confirmation of stratified sampling to maintain class distribution, and note that cross-validation was not performed. We will also discuss the rationale and potential limitations of using a single split. revision: yes
-
Referee: BiLSTM model description and training procedure: No information is supplied on the BiLSTM architecture (number of layers, hidden dimension, embedding size), regularization (dropout, weight decay), optimizer, learning-rate schedule, batch size, or number of epochs. The claim of 'mild overfitting' is therefore unsupported by any training/validation curves or quantitative evidence.
Authors: We acknowledge this gap in the model description. The revised manuscript will include a complete specification of the BiLSTM architecture, including the number of layers, hidden dimensions, embedding size, dropout rates, optimizer choice, learning rate, batch size, and number of training epochs. Furthermore, we will add plots of training and validation loss/accuracy curves to provide quantitative support for the mild overfitting observation. revision: yes
-
Referee: Statistical comparison: The manuscript asserts that Logistic Regression outperformed BiLSTM but provides no statistical test (e.g., McNemar’s test, bootstrap confidence intervals, or paired t-test across multiple runs) to establish whether the 4.33 percentage-point difference is significant or could arise from random variation in a single run.
Authors: We concur that statistical significance testing would bolster the comparison. Since the current results are from single runs, we will augment the experiments by conducting multiple independent runs with different random seeds and report mean accuracies with standard deviations. Alternatively, we will compute bootstrap confidence intervals for the accuracy difference on the test set. This analysis will be incorporated into the revised Experimental Results section. revision: yes
-
Referee: Hyperparameter selection: The paper states that BiLSTM exhibited mild overfitting yet reports no hyperparameter search (grid, random, or Bayesian) or even default values used for the deep model. Without this, the conclusion that 'classical machine learning can outperform deep learning' rests on an untested assumption that the chosen BiLSTM configuration was near-optimal.
Authors: We appreciate this critique. The BiLSTM was implemented using standard library defaults without a dedicated hyperparameter optimization procedure, which we will now explicitly document in the revised paper. We will also clarify that our intent was to compare a straightforward deep learning baseline against a well-tuned classical model, rather than claiming superiority over an optimally tuned deep model. The conclusion will be nuanced to reflect this, emphasizing the value of simple baselines even when deep models are not extensively tuned. revision: partial
Circularity Check
Empirical comparison with no derivations or self-referential claims
full rationale
The paper is a straightforward experimental study comparing Logistic Regression (with TF-IDF) and BiLSTM on a 10k-tweet subset of Sentiment140. It reports observed accuracies (73.5% vs 69.17%) and notes mild overfitting in the DL model, but contains no equations, no first-principles derivations, no fitted parameters renamed as predictions, and no load-bearing self-citations. The central claim rests on direct benchmark results against an external dataset and is therefore self-contained with no reduction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tweets in the Sentiment140 subset are independent and identically distributed for training and evaluation
Reference graph
Works this paper leans on
-
[1]
Effat University Publication, 2020
Anil Rajput.Natural Language Processing, Sentiment Analysis and Clinical Analytics. Effat University Publication, 2020
work page 2020
-
[2]
M. N. Muttaqin and I. Kharisudin. Analisis sentimen aplikasi gojek menggunakan support vector machine dan k-nearest neighbor.UNNES Journal of Mathematics, 10(1):22–27, 2021
work page 2021
-
[3]
Semeval-2016 task 4: Sentiment analysis in twitter
Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoyanov. Semeval-2016 task 4: Sentiment analysis in twitter. InProceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1–18, 2016
work page 2016
-
[4]
Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval.Information Processing & Management, 24(5):513–523, 1988
work page 1988
-
[5]
H. Hendriyana, I. M. Karo Karo, and S. Dewi. Analisis perbandingan algoritma support vector machine, naive bayes, dan regresi logistik untuk memprediksi donor darah.Jurnal Teknologi Terpadu, 8(2):121–126, 2022
work page 2022
-
[6]
Long short-term memory.Neural computation, 9(8):1735–1780, 1997
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997
work page 1997
-
[7]
D. R. Alghifari, M. Edi, and L. Firmansyah. Implementasi bidirectional lstm untuk analisis sentimen terhadap layanan grab indonesia.Jurnal Manajemen Informatika (JAMIKA), 12(2):89–100, 2022
work page 2022
-
[8]
Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks.IEEE transactions on Signal Processing, 45(11):2673–2681, 1997
work page 1997
-
[9]
Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision.CS224N Project Report, Stanford, 1(12), 2009
work page 2009
-
[10]
S. Elmaliyasari, M. A. Alzam, N. A. Pratiwi, S. S. M. Wara, and K. M. Hindrayani. Deteksi sentimen komentar aplikasi gobis suroboyo dengan metode naive bayes dan metode regresi logistik.JDMIS: Journal of Data Mining and Information Systems, 3(2):108–116, 2025
work page 2025
-
[11]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, et al. Scikit-learn: Machine learning in python.The Journal of Machine Learning Research, 12:2825–2830, 2011
work page 2011
-
[12]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, et al. Pytorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Systems, volume 32, 2019. 8
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.