pith. sign in

arxiv: 2605.01322 · v1 · submitted 2026-05-02 · 💻 cs.CL

Benchmarking LightGBM and BiLSTM for Sentiment Analysis on Indonesian E-Commerce Reviews

Pith reviewed 2026-05-09 14:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords sentiment analysisBiLSTMLightGBMIndonesian languagee-commerce reviewsmachine learningdeep learningnatural language processing
0
0 comments X

The pith

BiLSTM achieves 98.87 percent accuracy on sentiment analysis of Indonesian e-commerce reviews, outperforming LightGBM and other ML models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares machine learning models trained via the PyCaret framework, specifically LightGBM, Logistic Regression, and SVM, against a Bidirectional LSTM deep learning model. The comparison uses a 15,000-sample Indonesian e-commerce review dataset from Hugging Face, split into training, validation, and test portions. BiLSTM reaches 98.87 percent accuracy and F1-score while LightGBM reaches 98.23 percent with fast training. A sympathetic reader would care because the result shows deep learning can better handle sequential context in non-English review text, helping automate customer feedback analysis for e-commerce platforms.

Core claim

The BiLSTM model outperforms all ML models, achieving an accuracy of 98.87% and an F1-Score of 98.87%. Meanwhile, LightGBM emerges as the best-performing ML model with an accuracy of 98.23% in a highly efficient training time. This research proves that the BiLSTM architecture is highly capable of capturing the sequential context of Indonesian review texts, making it the superior model for this specific classification task.

What carries the argument

BiLSTM architecture, which processes text sequences bidirectionally to capture forward and backward context for improved sentiment classification over standard ML models.

If this is right

  • BiLSTM should be considered first for sentiment tasks on Indonesian text when high accuracy is the priority.
  • LightGBM remains a practical choice for similar classification when training speed and compute resources matter more than the extra accuracy gain.
  • The results indicate that bidirectional sequence modeling adds measurable value for review texts that contain context-dependent phrasing.
  • Standard splits on this dataset size allow reproducible benchmarking between ML and DL approaches for this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The performance gap may widen or shrink when the same models are applied to other low-resource languages with similar review structures.
  • Adding Indonesian-specific tokenization or morphological preprocessing could further improve BiLSTM results beyond the reported numbers.
  • E-commerce platforms operating in Indonesia could integrate BiLSTM-based sentiment scoring to prioritize responses to negative reviews more effectively.

Load-bearing premise

The Hugging Face Indonesian e-commerce review dataset is representative, balanced, correctly labeled, and free of leakage, with standard train-validation-test splits being sufficient without extra Indonesian-specific preprocessing.

What would settle it

Retrain both BiLSTM and LightGBM on a new, independently collected and labeled collection of Indonesian e-commerce reviews and check whether BiLSTM still exceeds 98 percent accuracy while outperforming the ML baseline.

Figures

Figures reproduced from arXiv: 2605.01322 by Ardika Satria, Iqfina Haula Halika, Lidia Natasyah Marpaung, Luluk Muthoharoh, Martin Clinton Tosima Manullang, Vania Claresta.

Figure 1
Figure 1. Figure 1: Label distribution of the e-commerce sentiment dataset. view at source ↗
Figure 2
Figure 2. Figure 2: Training and validation curves (Loss and Accuracy) for the BiLSTM model. view at source ↗
Figure 3
Figure 3. Figure 3: Confusion Matrix for the BiLSTM Model. 5.4 Comparative Discussion The training curves previously illustrated in view at source ↗
read the original abstract

This study presents a comparative analysis between two primary approaches in Natural Language Processing (NLP): Machine Learning (ML) utilizing the PyCaret AutoML framework, and Deep Learning (DL). The evaluation is conducted on a sentiment analysis task using an Indonesian e-commerce review dataset sourced from Hugging Face. The dataset, consisting of 15,000 samples, is partitioned into training, validation, and testing sets. The ML experiments compare LightGBM, Logistic Regression, and Support Vector Machine (SVM) algorithms, whereas the DL experiment implements a Bidirectional Long Short-Term Memory (BiLSTM) architecture. The experimental results demonstrate that the BiLSTM model outperforms all ML models, achieving an accuracy of 98.87\% and an F1-Score of 98.87\%. Meanwhile, LightGBM emerges as the best-performing ML model with an accuracy of 98.23\% in a highly efficient training time. This research proves that the BiLSTM architecture is highly capable of capturing the sequential context of Indonesian review texts, making it the superior model for this specific classification task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents a benchmark comparing machine learning models (LightGBM, Logistic Regression, SVM via PyCaret) against a BiLSTM deep learning model for sentiment analysis on a 15,000-sample Indonesian e-commerce review dataset from Hugging Face. It reports that BiLSTM achieves the highest performance with 98.87% accuracy and F1-score, outperforming the best ML model LightGBM at 98.23%, and attributes this to BiLSTM's ability to capture sequential context in Indonesian texts.

Significance. If the reported performance gap holds under rigorous validation, the work would provide modest evidence that BiLSTM offers a small but consistent edge over efficient gradient-boosted trees for this task, which could inform practical model selection in Indonesian NLP applications where training time matters. However, the absolute accuracies are high enough to suggest the dataset may contain many easy or repetitive examples, so the result's broader significance for sequential modeling advantages in low-resource languages remains limited without controls for data artifacts.

major comments (3)
  1. [Abstract] Abstract: The superiority claim for BiLSTM rests on a 0.64-point accuracy gap over LightGBM, yet the text supplies no train/validation/test split ratios, no class distribution or stratification details, and no statistical significance test (e.g., McNemar or paired bootstrap) for the difference; without these, the gap cannot be distinguished from dataset partitioning effects or random variation.
  2. [Abstract] Abstract / Experimental Setup: No description is given of text preprocessing steps tailored to Indonesian (e.g., handling of affixes, reduplication, slang, or code-mixing with English), nor of tokenization, embedding initialization, or hyperparameter search ranges for either PyCaret models or the BiLSTM; these omissions make it impossible to attribute the performance difference to architecture rather than data handling choices.
  3. [Results] Results: The manuscript reports only aggregate accuracy and F1 without error analysis, confusion matrices, or qualitative review of misclassified examples; this prevents verification that BiLSTM's advantage arises from better sequential modeling rather than from exploiting trivial patterns or label noise in the Hugging Face dataset.
minor comments (2)
  1. [Abstract] The abstract states the dataset 'is partitioned' but never specifies the exact proportions or whether the split was stratified by label or review length.
  2. [Abstract] Training-time comparisons are mentioned for LightGBM but not quantified for BiLSTM or the other ML baselines, weakening the efficiency claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to improve reproducibility and interpretability of the results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The superiority claim for BiLSTM rests on a 0.64-point accuracy gap over LightGBM, yet the text supplies no train/validation/test split ratios, no class distribution or stratification details, and no statistical significance test (e.g., McNemar or paired bootstrap) for the difference; without these, the gap cannot be distinguished from dataset partitioning effects or random variation.

    Authors: We agree that these details are necessary to validate the reported gap. In the revised manuscript, we will explicitly state the train/validation/test split ratios, class distribution with stratification details, and include a McNemar's test to confirm the statistical significance of the 0.64% accuracy difference between BiLSTM and LightGBM. revision: yes

  2. Referee: [Abstract] Abstract / Experimental Setup: No description is given of text preprocessing steps tailored to Indonesian (e.g., handling of affixes, reduplication, slang, or code-mixing with English), nor of tokenization, embedding initialization, or hyperparameter search ranges for either PyCaret models or the BiLSTM; these omissions make it impossible to attribute the performance difference to architecture rather than data handling choices.

    Authors: We acknowledge this omission limits attribution of results to model architecture. We will expand the Experimental Setup section to detail Indonesian-specific preprocessing steps (including affixes, reduplication, slang, and code-mixing), tokenization method, embedding initialization, and hyperparameter settings or search ranges used for both PyCaret models and BiLSTM. revision: yes

  3. Referee: [Results] Results: The manuscript reports only aggregate accuracy and F1 without error analysis, confusion matrices, or qualitative review of misclassified examples; this prevents verification that BiLSTM's advantage arises from better sequential modeling rather than from exploiting trivial patterns or label noise in the Hugging Face dataset.

    Authors: We agree that aggregate metrics alone are insufficient for interpretation. We will add confusion matrices and a qualitative error analysis of misclassified examples to the Results section, focusing on instances where BiLSTM's sequential context capture provides an edge, to address potential data artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmarking on held-out data

full rationale

The paper reports accuracy and F1 scores from standard model training and evaluation on a partitioned dataset (train/validation/test splits of the 15k Hugging Face samples). No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear. BiLSTM and LightGBM results are measured outcomes, not constructed by construction from inputs. No self-citations are used to justify uniqueness or import ansatzes. The central claim reduces to experimental measurement, which is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions (i.i.d. splits, appropriate feature extraction for text) with no free parameters, axioms, or invented entities explicitly introduced beyond the choice of model architectures.

pith-pipeline@v0.9.0 · 5515 in / 1019 out tokens · 52469 ms · 2026-05-09T14:44:25.617219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 10 canonical work pages

  1. [1]

    Sentiment analysis of e-commerce product reviews on tokopedia using support vector machine

    Azna Alaiya and Cut Agusniar. Sentiment analysis of e-commerce product reviews on tokopedia using support vector machine. Technical report, 2025. URLhttp://jurnal.polibatam.ac.id/index.php/JAIC

  2. [2]

    Rahmi Anadra, Hari Wijayanto, and Kusman Sadik. Sentiment analysis of tokopedia customer reviews using bilstm and indobert with comparative analysis of preprocessing and labeling methods.International Journal of Advances in Data and Information Systems, 6:773–788, 12 2025. doi:10.59395/ijadis.v6i3.1458

  3. [3]

    Sentiment analysis about investors and consumers in energy market based on bert-bilstm.IEEE Access, 8:171408–171415, 2020

    Ren Cai, Bin Qin, Yangken Chen, Liang Zhang, Ruijiang Yang, Shiwei Chen, and Wei Wang. Sentiment analysis about investors and consumers in energy market based on bert-bilstm.IEEE Access, 8:171408–171415, 2020. ISSN 21693536. doi:10.1109/ACCESS.2020.3024750

  4. [4]

    Analisis sentiment ulasan produk e-commerce menggunakan metode logistic regression

    Nabil Ali Fahrurrozi, Sri Hadianti, M Kom, and Corresponding Author. Analisis sentiment ulasan produk e-commerce menggunakan metode logistic regression. XX, 2025. ISSN 2528-5211

  5. [5]

    Implementation of bilstm and indobert for sentiment analy- sis of tiktok reviews.JIPI (Jurnal Ilmiah Penelitian dan Pembelajaran Informatika), 10:96–106, 1 2025

    Azziz Fachry Al Farizi and Yuliant Sibaroni. Implementation of bilstm and indobert for sentiment analy- sis of tiktok reviews.JIPI (Jurnal Ilmiah Penelitian dan Pembelajaran Informatika), 10:96–106, 1 2025. doi:10.29100/jipi.v10i1.5815

  6. [6]

    Analisis sentimen komentar tiktok terhadap produk skintific

    Irma Nurul Fauziah, Dini Turipanam Alamanda, and Dwi Nurhayati. Analisis sentimen komentar tiktok terhadap produk skintific. Technical report, 2025. URLhttps://ejournal.upi.edu/index.php/IJDB

  7. [7]

    Opinion mining on tiktok using bidirectional long short-term memory for enhanced sentiment analysis and trend prediction.Technology and Science (BITS), 7, 2025

    Wafiq Muharnisa Haspin, Junadhi Junadhi, Susanti Susanti, and Helda Yenni. Opinion mining on tiktok using bidirectional long short-term memory for enhanced sentiment analysis and trend prediction.Technology and Science (BITS), 7, 2025. ISSN 2685-3310. doi:10.47065/bits.v7i2.8019

  8. [8]

    Sentiment analysis of twitter data related to rinca island development using doc2vec and svm and logistic regression as classifier

    Tirta Hema Jaya Hidayat, Yova Ruldeviyani, Achmad Rizki Aditama, Gusti Raditia Madya, Ade Wija Nugraha, and Muhammad Wijaya Adisaputra. Sentiment analysis of twitter data related to rinca island development using doc2vec and svm and logistic regression as classifier. InProcedia Computer Science, volume 197, pages 660–667. Elsevier B.V ., 2021. doi:10.1016...

  9. [9]

    Khairunnisa Nasution, Khairun Saddami, Roslidar Roslidar, Akhyar Akhyar, Fathurrahman Fathurrahman, and Niza Aulia. Comparative study of bilstm and gru for sentiment analysis on indonesian e-commerce product reviews using deep sequential modeling.Jurnal Teknik Informatika (Jutif), 6:1881–1896, 8 2025. ISSN 2723-3863. doi:10.52436/1.jutif.2025.6.4.4878

  10. [10]

    Aqiilah Cahya Ningrum, Rizal Tjut Adek, and Nunsina. Pengembangan bot komentar otomatis dengan analisis sentimen berbasis bert pada tiktok untuk umkm di lhokseumawe.Rabit : Jurnal Teknologi dan Sistem Informasi Univrab, 10:431–444, 7 2025. ISSN 2477-2062. doi:10.36341/rabit.v10i2.6152

  11. [11]

    Syifa Nurjanah and Yordan Hermawan Apidana. Technology and informatics insight journal analisis sentimen tiktok untuk mengevaluasi reputasi merek pasca kasus overclaim: Studi pada daviena skincare.Technology and Informatics Insight Journal, 4, 2025. ISSN 2830-3210. URL https://jurnal.universitasputrabangsa. ac.id/index.php/tiij

  12. [12]

    Pongthao, A

    J. Pongthao, A. Na-udom, and J. Rungrattanaubol. Machine learning classification with logistic regression feature selection approach on health datasets. Technical report, IAENG International Journal of Applied Mathematics, 6 2025

  13. [13]

    PhD thesis, 2023

    Rona Guines Purnasiwi.ANALISIS SENTIMEN PADA REVIEW PRODUK SKINCARE MENGGUNAKAN WORD EMBEDDING DAN METODE LONG SHORT-TERM MEMORY (LSTM). PhD thesis, 2023

  14. [14]

    El-Sherbeeny, and Waqar Ali

    Muhammad Rizwan Rashid Rana, Asif Nawaz, Tariq Ali, Ahmed M. El-Sherbeeny, and Waqar Ali. A bilstm-cf and bigru-based deep sentiment analysis model to explore customer reviews for effective recommendations, 10

  15. [15]

    Mohamad Romli, Firdaus Kamarula, and Naim Rochmawati. Perbandingan cnn dan bi-lstm pada analisis sentimen dan emosi masyarakat indonesia di media sosial twitter selama pandemik covid-19 yang menggunakan metode word2vec.Journal of Informatics and Computer Science, 04, 2022. ISSN 2686-2220. URL http: //ai.stanford.edu/~amaas/data/sentimen/

  16. [16]

    Semary, Wesam Ahmed, Khalid Amin, Paweł Pławiak, and Mohamed Hammad

    Noura A. Semary, Wesam Ahmed, Khalid Amin, Paweł Pławiak, and Mohamed Hammad. Enhancing machine learning-based sentiment analysis through feature extraction techniques.PLoS ONE, 19, 2 2024. ISSN 19326203. doi:10.1371/journal.pone.0294968

  17. [17]

    Lhaksmana, and Bunyamin Bunyamin

    Jerry Cahyo Setiawan, Kemas M. Lhaksmana, and Bunyamin Bunyamin. Sentiment analysis of indonesian tiktok review using lstm and indobertweet algorithm.JIPI (Jurnal Ilmiah Penelitian dan Pembelajaran Informatika), 8: 774–780, 8 2023. doi:10.29100/jipi.v8i3.3911. 8 Benchmarking LightGBM and BiLSTM

  18. [18]

    Yerik Afrianto Singgalen. Klik: Kajian ilmiah informatika dan komputer performance evaluation of svm algorithm in sentiment classification: A visual journey of wonderful indonesia content.Media Online, 4, 2024. ISSN 2723-3898. doi:10.30865/klik.v4i4.1709. 9