Benchmarking Logistic Regression, SVM, and LightGBM Against BiLSTM with Attention for Sentiment Analysis on Indonesian Product Reviews
Pith reviewed 2026-05-07 16:30 UTC · model grok-4.3
The pith
Traditional machine learning classifiers can marginally outperform deep learning models for sentiment analysis on Indonesian product reviews.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Logistic regression achieves the best performance among the evaluated machine learning approaches with 97.26 percent accuracy and F1-score, slightly exceeding the bidirectional long short-term memory model with attention at 97.24 percent accuracy and F1-score on the test portion of Indonesian product review data. The work establishes that traditional machine learning algorithms with proper preprocessing and feature extraction can compete closely with, and even marginally outperform, more complex sequential deep learning architectures on high-dimensional datasets, while simultaneously offering greater computational efficiency.
What carries the argument
The direct side-by-side comparison of feature-engineered machine learning classifiers against an attention-augmented recurrent neural network for binary sentiment classification on product review text.
If this is right
- For similar high-dimensional text classification tasks, simpler machine learning models can be selected to achieve high accuracy without the training overhead of deep networks.
- E-commerce platforms can implement sentiment monitoring with lower resource demands while maintaining near-equivalent performance levels.
- Thorough preprocessing and automated tuning enable conventional classifiers to remain viable alternatives to sequential neural architectures in production settings.
Where Pith is reading between the lines
- The narrow performance difference suggests that attention layers may contribute little when input features already capture strong signals from bag-of-words or similar representations.
- Repeating the benchmark on unbalanced review data or in other languages could test whether the preference for machine learning models generalizes beyond the balanced Indonesian binary case.
- This style of controlled comparison offers a template for deciding model families in other natural language processing applications where efficiency matters.
Load-bearing premise
The machine learning models and the deep learning model received comparable hyperparameter optimization effort and the chosen feature representations do not systematically favor the simpler algorithms.
What would settle it
A retraining of the bidirectional long short-term memory model with attention using a wider hyperparameter search or different text embeddings that raises its accuracy above 97.26 percent on the same held-out test samples would falsify the marginal outperformance claim.
Figures
read the original abstract
Sentiment analysis of product reviews on e-commerce platforms plays a critical role in automatically understanding customer satisfaction and providing actionable insights for sellers seeking to improve product quality. This paper presents a comprehensive benchmarking study comparing a Machine Learning (ML) approach via the PyCaret AutoML framework against a Deep Learning (DL) approach based on a Bidirectional Long Short-Term Memory (BiLSTM) architecture with an Attention mechanism for binary sentiment classification on Indonesian product reviews. The dataset comprises 19,728 samples balanced equally between positive and negative reviews. For the ML approach, three prominent algorithms were evaluated via 10-fold stratified cross-validation: Logistic Regression (LR), Support Vector Machine (SVM) with a linear kernel, and Light Gradient Boosting Machine (LightGBM). Logistic Regression achieved the best ML performance with an accuracy of 97.26\% and an F1-score of 97.26\%. The BiLSTM with Attention model, evaluated on 3,946 held-out test samples, achieved an accuracy of 97.24\% and an F1-score of 97.24\%. These comparative results demonstrate that traditional ML algorithms with proper preprocessing and feature extraction can compete closely with, and even marginally outperform, more complex sequential DL architectures on high-dimensional datasets, while simultaneously offering greater computational efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks three PyCaret-tuned traditional ML models (Logistic Regression, linear SVM, LightGBM) against a BiLSTM-with-attention deep model for binary sentiment classification on a balanced set of 19,728 Indonesian product reviews. Using 10-fold stratified cross-validation, Logistic Regression reaches 97.26% accuracy and F1; the BiLSTM, evaluated on a 3,946-sample held-out test set, reaches 97.24%. The authors conclude that properly preprocessed traditional ML can match or slightly exceed complex sequential DL while being computationally cheaper.
Significance. If the hyperparameter-search effort and feature representations are shown to be comparable, the result would usefully demonstrate that AutoML-tuned linear and tree-based models remain competitive with attention-based recurrent networks on high-dimensional text data, reinforcing the value of efficient baselines before scaling to DL. The concrete accuracy/F1 numbers obtained under stratified CV and a held-out test set provide a clear, falsifiable empirical anchor.
major comments (2)
- [BiLSTM experimental configuration and hyperparameter tuning] The central claim that traditional ML 'can compete closely with, and even marginally outperform, more complex sequential DL architectures' rests on the assumption of equivalent modeling effort. The manuscript describes PyCaret's automated grid/random search over regularization, solver, class_weight, n_estimators, max_depth, learning_rate, etc., plus automatic feature engineering for LR/SVM/LightGBM, but provides no corresponding search budget, ranges, or early-stopping protocol for the BiLSTM (embedding dim, hidden size, attention heads, dropout, optimizer schedule, epochs). Without this, the 0.02% gap cannot be interpreted as evidence that the architectures are comparable under equal optimization.
- [Evaluation methodology and results reporting] Evaluation protocols are inconsistent: ML results are reported from 10-fold stratified cross-validation (with no per-fold variance or standard deviation supplied), while the BiLSTM result is from a single held-out test set. This asymmetry makes it impossible to assess whether the reported 97.26% vs 97.24% difference is statistically meaningful or sensitive to the particular test split.
minor comments (2)
- [ML preprocessing pipeline] Tokenization, vocabulary size, and exact preprocessing steps (stop-word removal, n-gram range, TF-IDF vs count vectorizer) for the ML pipeline are not stated, preventing exact reproduction of the 97.26% LR figure.
- [Model architecture description] The BiLSTM architecture diagram or layer dimensions (embedding size, LSTM units, attention mechanism details) are missing; only the high-level description is given.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have prompted us to improve the transparency and rigor of our experimental reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [BiLSTM experimental configuration and hyperparameter tuning] The central claim that traditional ML 'can compete closely with, and even marginally outperform, more complex sequential DL architectures' rests on the assumption of equivalent modeling effort. The manuscript describes PyCaret's automated grid/random search over regularization, solver, class_weight, n_estimators, max_depth, learning_rate, etc., plus automatic feature engineering for LR/SVM/LightGBM, but provides no corresponding search budget, ranges, or early-stopping protocol for the BiLSTM (embedding dim, hidden size, attention heads, dropout, optimizer schedule, epochs). Without this, the 0.02% gap cannot be interpreted as evidence that the architectures are comparable under equal optimization.
Authors: We acknowledge that the original manuscript did not sufficiently detail the BiLSTM hyperparameter optimization process, which limits the ability to directly compare modeling effort. In our experiments, the BiLSTM was tuned via a combination of grid search and validation-based selection over embedding dimensions (100, 200), hidden sizes (128, 256), attention heads (1-2), dropout (0.2-0.5), and Adam optimizer with learning rate 0.001 and early stopping on validation loss (patience of 5 epochs). To address this gap transparently, we will add a dedicated subsection in the revised Methods and Experiments sections describing the full search space, number of configurations evaluated, and stopping criteria. This will allow readers to assess the comparability of optimization effort while preserving our core observation that well-tuned traditional ML models can achieve competitive performance with lower computational cost. revision: yes
-
Referee: [Evaluation methodology and results reporting] Evaluation protocols are inconsistent: ML results are reported from 10-fold stratified cross-validation (with no per-fold variance or standard deviation supplied), while the BiLSTM result is from a single held-out test set. This asymmetry makes it impossible to assess whether the reported 97.26% vs 97.24% difference is statistically meaningful or sensitive to the particular test split.
Authors: We agree that the lack of variance reporting for the 10-fold CV and the single-run evaluation for BiLSTM creates an asymmetry that hinders interpretation of the small observed difference. The ML models used stratified 10-fold CV due to their computational efficiency, while the BiLSTM used a fixed 20% held-out test set (3,946 samples) to manage training time. In the revised manuscript, we will update the Results section to report mean accuracy and F1-score plus standard deviation across the 10 folds for all ML models. We will also add a discussion paragraph acknowledging the single-run limitation for the BiLSTM and noting that the 0.02% difference falls well within typical cross-split variability at these high performance levels. This revision will enable better assessment of robustness without requiring full re-execution of the BiLSTM under CV. revision: partial
Circularity Check
No circularity: purely empirical benchmarking with direct performance measurements
full rationale
The paper reports measured accuracy and F1 scores from 10-fold stratified cross-validation on the ML models (via PyCaret) and a held-out test set for the BiLSTM+attention model. No derivations, equations, fitted parameters presented as predictions, or self-referential steps exist. The central claim rests on direct experimental outcomes on the 19,728-sample dataset splits rather than any reduction to inputs by construction. This is self-contained empirical work with no load-bearing self-citations or ansatzes.
Axiom & Free-Parameter Ledger
free parameters (1)
- Hyperparameters for LR, SVM, LightGBM, and BiLSTM
axioms (1)
- domain assumption The 19,728-review dataset is representative of general Indonesian product-review sentiment.
Reference graph
Works this paper leans on
-
[1]
PyCaret: An open source, low-code machine learning library in Python.https://pycaret.org, 2020
Moez Ali. PyCaret: An open source, low-code machine learning library in Python.https://pycaret.org, 2020. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In3rd International Conference on Learning Representations (ICLR 2015), 2015. Imam Riadi, Rusydi Umar, and Ridho Nasrulloh. Sent...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.