Benchmarking Logistic Regression, SVM, and LightGBM Against BiLSTM with Attention for Sentiment Analysis on Indonesian Product Reviews

Ardika Satria; Hanna Gresia Sinaga; Ivana Margareth Hutabarat; Luluk Muthoharoh; Martin C.T. Manullang; Razin Hafid Hamdi

arxiv: 2604.25452 · v1 · submitted 2026-04-28 · 💻 cs.CL

Benchmarking Logistic Regression, SVM, and LightGBM Against BiLSTM with Attention for Sentiment Analysis on Indonesian Product Reviews

Razin Hafid Hamdi , Ivana Margareth Hutabarat , Hanna Gresia Sinaga , Luluk Muthoharoh , Ardika Satria , Martin C.T. Manullang This is my paper

Pith reviewed 2026-05-07 16:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords sentiment analysisIndonesian product reviewsmachine learningdeep learningbenchmarkinglogistic regressionbidirectional LSTM

0 comments

The pith

Traditional machine learning classifiers can marginally outperform deep learning models for sentiment analysis on Indonesian product reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks machine learning algorithms against a bidirectional long short-term memory network with attention for binary classification of positive and negative sentiments in Indonesian e-commerce reviews. Logistic regression with automated tuning and feature extraction reaches 97.26 percent accuracy and F1-score on a balanced set of nearly 20,000 samples, compared with 97.24 percent for the deep learning model on held-out data. This matters because it demonstrates that simpler models can deliver equivalent or better results on high-dimensional text data when preprocessing is thorough. A reader would care about the implication that complex sequential architectures are not always necessary for practical sentiment tasks, allowing lower computational costs in deployment.

Core claim

Logistic regression achieves the best performance among the evaluated machine learning approaches with 97.26 percent accuracy and F1-score, slightly exceeding the bidirectional long short-term memory model with attention at 97.24 percent accuracy and F1-score on the test portion of Indonesian product review data. The work establishes that traditional machine learning algorithms with proper preprocessing and feature extraction can compete closely with, and even marginally outperform, more complex sequential deep learning architectures on high-dimensional datasets, while simultaneously offering greater computational efficiency.

What carries the argument

The direct side-by-side comparison of feature-engineered machine learning classifiers against an attention-augmented recurrent neural network for binary sentiment classification on product review text.

If this is right

For similar high-dimensional text classification tasks, simpler machine learning models can be selected to achieve high accuracy without the training overhead of deep networks.
E-commerce platforms can implement sentiment monitoring with lower resource demands while maintaining near-equivalent performance levels.
Thorough preprocessing and automated tuning enable conventional classifiers to remain viable alternatives to sequential neural architectures in production settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The narrow performance difference suggests that attention layers may contribute little when input features already capture strong signals from bag-of-words or similar representations.
Repeating the benchmark on unbalanced review data or in other languages could test whether the preference for machine learning models generalizes beyond the balanced Indonesian binary case.
This style of controlled comparison offers a template for deciding model families in other natural language processing applications where efficiency matters.

Load-bearing premise

The machine learning models and the deep learning model received comparable hyperparameter optimization effort and the chosen feature representations do not systematically favor the simpler algorithms.

What would settle it

A retraining of the bidirectional long short-term memory model with attention using a wider hyperparameter search or different text embeddings that raises its accuracy above 97.26 percent on the same held-out test samples would falsify the marginal outperformance claim.

Figures

Figures reproduced from arXiv: 2604.25452 by Ardika Satria, Hanna Gresia Sinaga, Ivana Margareth Hutabarat, Luluk Muthoharoh, Martin C.T. Manullang, Razin Hafid Hamdi.

**Figure 1.** Figure 1: Training and validation loss (left) and accuracy (right) curves for the best BiLSTM+Attention model. The view at source ↗

**Figure 2.** Figure 2: Optuna hyperparameter optimization results. view at source ↗

read the original abstract

Sentiment analysis of product reviews on e-commerce platforms plays a critical role in automatically understanding customer satisfaction and providing actionable insights for sellers seeking to improve product quality. This paper presents a comprehensive benchmarking study comparing a Machine Learning (ML) approach via the PyCaret AutoML framework against a Deep Learning (DL) approach based on a Bidirectional Long Short-Term Memory (BiLSTM) architecture with an Attention mechanism for binary sentiment classification on Indonesian product reviews. The dataset comprises 19,728 samples balanced equally between positive and negative reviews. For the ML approach, three prominent algorithms were evaluated via 10-fold stratified cross-validation: Logistic Regression (LR), Support Vector Machine (SVM) with a linear kernel, and Light Gradient Boosting Machine (LightGBM). Logistic Regression achieved the best ML performance with an accuracy of 97.26\% and an F1-score of 97.26\%. The BiLSTM with Attention model, evaluated on 3,946 held-out test samples, achieved an accuracy of 97.24\% and an F1-score of 97.24\%. These comparative results demonstrate that traditional ML algorithms with proper preprocessing and feature extraction can compete closely with, and even marginally outperform, more complex sequential DL architectures on high-dimensional datasets, while simultaneously offering greater computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows PyCaret-tuned logistic regression tying a BiLSTM with attention at ~97.25% on this Indonesian review set, but the comparison likely favors the ML side due to heavier automated tuning.

read the letter

The main thing to know is that logistic regression via PyCaret hits 97.26% accuracy and F1 on the balanced 19k Indonesian product reviews, matching the BiLSTM with attention at 97.24% on the held-out test set. The work runs proper 10-fold stratified cross-validation for the ML models and reports the numbers cleanly. That part is straightforward and useful as a baseline for this language and domain. The paper does a decent job documenting the dataset balance and the close performance, which supports the practical point that simpler models can deliver comparable results with less compute once preprocessing and feature work are handled well. The efficiency angle is also noted without overclaiming. The soft spot is the tuning imbalance. PyCaret automates extensive hyperparameter search and feature engineering across LR, SVM, and LightGBM, while the BiLSTM description gives no equivalent detail on search budget, embedding dimensions, hidden size, attention setup, dropout, or optimizer choices. Without that, the near-tie does not strongly back the claim that traditional ML competes with properly optimized sequential DL. Reproducibility also suffers from missing tokenization and exact preprocessing steps. This is incremental work rather than a new method, but the concrete numbers on a non-English dataset make it worth having in the literature. It is the sort of paper a practitioner building sentiment tools for e-commerce in Indonesian or similar languages would find handy for quick comparisons. It deserves peer review because the empirical setup is solid enough to evaluate once the tuning and pipeline details are filled in.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks three PyCaret-tuned traditional ML models (Logistic Regression, linear SVM, LightGBM) against a BiLSTM-with-attention deep model for binary sentiment classification on a balanced set of 19,728 Indonesian product reviews. Using 10-fold stratified cross-validation, Logistic Regression reaches 97.26% accuracy and F1; the BiLSTM, evaluated on a 3,946-sample held-out test set, reaches 97.24%. The authors conclude that properly preprocessed traditional ML can match or slightly exceed complex sequential DL while being computationally cheaper.

Significance. If the hyperparameter-search effort and feature representations are shown to be comparable, the result would usefully demonstrate that AutoML-tuned linear and tree-based models remain competitive with attention-based recurrent networks on high-dimensional text data, reinforcing the value of efficient baselines before scaling to DL. The concrete accuracy/F1 numbers obtained under stratified CV and a held-out test set provide a clear, falsifiable empirical anchor.

major comments (2)

[BiLSTM experimental configuration and hyperparameter tuning] The central claim that traditional ML 'can compete closely with, and even marginally outperform, more complex sequential DL architectures' rests on the assumption of equivalent modeling effort. The manuscript describes PyCaret's automated grid/random search over regularization, solver, class_weight, n_estimators, max_depth, learning_rate, etc., plus automatic feature engineering for LR/SVM/LightGBM, but provides no corresponding search budget, ranges, or early-stopping protocol for the BiLSTM (embedding dim, hidden size, attention heads, dropout, optimizer schedule, epochs). Without this, the 0.02% gap cannot be interpreted as evidence that the architectures are comparable under equal optimization.
[Evaluation methodology and results reporting] Evaluation protocols are inconsistent: ML results are reported from 10-fold stratified cross-validation (with no per-fold variance or standard deviation supplied), while the BiLSTM result is from a single held-out test set. This asymmetry makes it impossible to assess whether the reported 97.26% vs 97.24% difference is statistically meaningful or sensitive to the particular test split.

minor comments (2)

[ML preprocessing pipeline] Tokenization, vocabulary size, and exact preprocessing steps (stop-word removal, n-gram range, TF-IDF vs count vectorizer) for the ML pipeline are not stated, preventing exact reproduction of the 97.26% LR figure.
[Model architecture description] The BiLSTM architecture diagram or layer dimensions (embedding size, LSTM units, attention mechanism details) are missing; only the high-level description is given.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which have prompted us to improve the transparency and rigor of our experimental reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [BiLSTM experimental configuration and hyperparameter tuning] The central claim that traditional ML 'can compete closely with, and even marginally outperform, more complex sequential DL architectures' rests on the assumption of equivalent modeling effort. The manuscript describes PyCaret's automated grid/random search over regularization, solver, class_weight, n_estimators, max_depth, learning_rate, etc., plus automatic feature engineering for LR/SVM/LightGBM, but provides no corresponding search budget, ranges, or early-stopping protocol for the BiLSTM (embedding dim, hidden size, attention heads, dropout, optimizer schedule, epochs). Without this, the 0.02% gap cannot be interpreted as evidence that the architectures are comparable under equal optimization.

Authors: We acknowledge that the original manuscript did not sufficiently detail the BiLSTM hyperparameter optimization process, which limits the ability to directly compare modeling effort. In our experiments, the BiLSTM was tuned via a combination of grid search and validation-based selection over embedding dimensions (100, 200), hidden sizes (128, 256), attention heads (1-2), dropout (0.2-0.5), and Adam optimizer with learning rate 0.001 and early stopping on validation loss (patience of 5 epochs). To address this gap transparently, we will add a dedicated subsection in the revised Methods and Experiments sections describing the full search space, number of configurations evaluated, and stopping criteria. This will allow readers to assess the comparability of optimization effort while preserving our core observation that well-tuned traditional ML models can achieve competitive performance with lower computational cost. revision: yes
Referee: [Evaluation methodology and results reporting] Evaluation protocols are inconsistent: ML results are reported from 10-fold stratified cross-validation (with no per-fold variance or standard deviation supplied), while the BiLSTM result is from a single held-out test set. This asymmetry makes it impossible to assess whether the reported 97.26% vs 97.24% difference is statistically meaningful or sensitive to the particular test split.

Authors: We agree that the lack of variance reporting for the 10-fold CV and the single-run evaluation for BiLSTM creates an asymmetry that hinders interpretation of the small observed difference. The ML models used stratified 10-fold CV due to their computational efficiency, while the BiLSTM used a fixed 20% held-out test set (3,946 samples) to manage training time. In the revised manuscript, we will update the Results section to report mean accuracy and F1-score plus standard deviation across the 10 folds for all ML models. We will also add a discussion paragraph acknowledging the single-run limitation for the BiLSTM and noting that the 0.02% difference falls well within typical cross-split variability at these high performance levels. This revision will enable better assessment of robustness without requiring full re-execution of the BiLSTM under CV. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct performance measurements

full rationale

The paper reports measured accuracy and F1 scores from 10-fold stratified cross-validation on the ML models (via PyCaret) and a held-out test set for the BiLSTM+attention model. No derivations, equations, fitted parameters presented as predictions, or self-referential steps exist. The central claim rests on direct experimental outcomes on the 19,728-sample dataset splits rather than any reduction to inputs by construction. This is self-contained empirical work with no load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on standard supervised learning assumptions plus the unstated premise that the automated hyperparameter search in PyCaret and the manual BiLSTM configuration are comparable.

free parameters (1)

Hyperparameters for LR, SVM, LightGBM, and BiLSTM
Performance depends on values chosen or tuned by PyCaret and the authors' BiLSTM training procedure.

axioms (1)

domain assumption The 19,728-review dataset is representative of general Indonesian product-review sentiment.
Generalization from this corpus to other Indonesian text is assumed without further validation.

pith-pipeline@v0.9.0 · 5565 in / 1151 out tokens · 62454 ms · 2026-05-07T16:30:07.466927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

1 extracted references · 1 canonical work pages

[1]

PyCaret: An open source, low-code machine learning library in Python.https://pycaret.org, 2020

Moez Ali. PyCaret: An open source, low-code machine learning library in Python.https://pycaret.org, 2020. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In3rd International Conference on Learning Representations (ICLR 2015), 2015. Imam Riadi, Rusydi Umar, and Ridho Nasrulloh. Sent...

work page 2020

[1] [1]

PyCaret: An open source, low-code machine learning library in Python.https://pycaret.org, 2020

Moez Ali. PyCaret: An open source, low-code machine learning library in Python.https://pycaret.org, 2020. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In3rd International Conference on Learning Representations (ICLR 2015), 2015. Imam Riadi, Rusydi Umar, and Ridho Nasrulloh. Sent...

work page 2020