pith. sign in

arxiv: 2605.03443 · v1 · submitted 2026-05-05 · 💻 cs.CL

Sentiment Analysis of Indonesian Spotify Reviews Using Machine Learning and BiLSTM

Pith reviewed 2026-05-07 16:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords sentiment analysisIndonesian languageSpotify reviewsBiLSTMmachine learningtext classificationSMOTEnatural language processing
0
0 comments X

The pith

BiLSTM achieves the highest weighted F1-score for three-class sentiment classification of Indonesian Spotify reviews but underperforms on the neutral class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks Support Vector Machine, Multinomial Naive Bayes, Decision Tree, and a two-layer BiLSTM on 70,155 cleaned Indonesian Spotify reviews for positive, negative, and neutral labels. All models share the same pipeline of slang normalization, stopword removal, and stemming, with SMOTE applied to address imbalance in the machine learning runs. BiLSTM leads on the aggregate weighted F1 metric, yet Decision Tree with SMOTE delivers more even accuracy across the three classes. A sympathetic reader would care because the comparison reveals concrete trade-offs when deploying sequence models versus classical classifiers on imbalanced, non-English social text from a consumer platform.

Core claim

The paper establishes that BiLSTM attains the highest weighted F1-score overall in three-class sentiment classification of Indonesian Spotify reviews, outperforming the classical models, but fails on the minority neutral class, whereas machine learning approaches that incorporate SMOTE achieve more balanced performance across positive, negative, and neutral categories.

What carries the argument

The two-layer BiLSTM that processes the preprocessed Indonesian text bidirectionally to capture forward and backward context for sentiment prediction.

If this is right

  • BiLSTM is the stronger choice when the priority is aggregate sentiment detection accuracy rather than per-class equity.
  • Machine learning models paired with SMOTE are preferable when balanced detection across positive, negative, and neutral is required.
  • Decision Tree is the strongest performer among the classical models under the shared preprocessing pipeline.
  • The preprocessing steps of slang normalization, stopword removal, and stemming effectively handle Indonesian-specific language features in review text.
  • Class imbalance in scraped review data causes models to favor majority sentiments unless countered by balancing techniques like SMOTE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid pipelines that feed SMOTE-balanced data into a BiLSTM could close the gap on the neutral class while retaining overall performance gains.
  • The pattern may generalize to other low-resource languages or review platforms, suggesting similar benchmarks could guide model selection elsewhere.
  • Customer analytics tools for music apps might route high-confidence positive or negative cases through BiLSTM and flag neutral cases for the more balanced ML path.
  • Larger or more deliberately sampled datasets would be needed to test whether BiLSTM's neutral-class weakness is inherent or simply a product of the current collection method.

Load-bearing premise

The scraped and cleaned dataset of 70,155 samples is representative of real-world Indonesian Spotify review sentiments without major bias introduced by the scraping process, cleaning rules, or class imbalance handling.

What would settle it

Running the identical models and preprocessing on a fresh collection of Indonesian Spotify reviews gathered through a different channel, such as an official API sample, that reverses the ranking on weighted F1 or equalizes neutral-class performance.

Figures

Figures reproduced from arXiv: 2605.03443 by Andre Hadiman Rotua Parhusip, Ardika Satria, Luluk Muthoharoh, Martin C. T. Manullang, Sahid Maulana, Uliano Wilyam Purba.

Figure 1
Figure 1. Figure 1: Sentiment distribution rating label from raw data. view at source ↗
Figure 2
Figure 2. Figure 2: Training and Validation Loss/Accuracy Curves during the optimization phase. view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of the three Scikit-learn classifiers across four evaluation metrics. view at source ↗
Figure 4
Figure 4. Figure 4: Per-class classification report of the Decision Tree classifier on the held-out test fold. view at source ↗
Figure 5
Figure 5. Figure 5: Confusion matrix of the Decision Tree classifier on the test fold (balanced via SMOTE). view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrix of the BiLSTM model on the held-out test set ( view at source ↗
Figure 7
Figure 7. Figure 7: BiLSTM training and validation loss (left) and accuracy (right) across six epochs. view at source ↗
read the original abstract

This paper benchmarks classical machine learning and deep learning approaches for three-class sentiment classification of Indonesian Spotify reviews. Using 100,000 scraped reviews and 70,155 cleaned samples, the study compares Support Vector Machine, Multinomial Naive Bayes, and Decision Tree models with a two-layer BiLSTM. Both approaches use the same preprocessing pipeline, including slang normalization, stopword removal, and stemming. Decision Tree achieves the best performance among the classical models, while BiLSTM attains the highest weighted F1-score overall but fails on the minority neutral class. The paper concludes that BiLSTM is stronger for overall sentiment detection, whereas machine learning with SMOTE provides more balanced three-class performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper benchmarks classical machine learning models (SVM, Multinomial Naive Bayes, Decision Tree) against a two-layer BiLSTM for three-class sentiment classification of Indonesian Spotify reviews. Starting from 100,000 scraped reviews cleaned to 70,155 samples via slang normalization, stopword removal, and stemming, it reports that Decision Tree performs best among the classical models while BiLSTM achieves the highest weighted F1-score overall but underperforms on the minority neutral class; the conclusion recommends BiLSTM for overall sentiment detection and ML with SMOTE for more balanced three-class results.

Significance. If the experimental details are supplied and dataset biases addressed, the work supplies a new domain-specific Indonesian dataset and a practical comparison of aggregate versus per-class performance trade-offs. This can inform model choice for similar low-resource sentiment tasks, particularly where class imbalance is present.

major comments (3)
  1. [§3] §3 (Data Collection and Preprocessing): The scraping (100k initial) and cleaning pipeline is described but not validated against external distributions; no analysis shows whether neutral reviews (already minority) were disproportionately affected by slang normalization or stemming, which directly threatens the generalizability of the BiLSTM vs. ML+SMOTE comparison.
  2. [§4.1] §4.1 (Experimental Setup): No information is given on train-test split ratio or stratification, hyperparameter search procedure for any model (including BiLSTM layer sizes, embedding, epochs), or whether SMOTE was applied to the BiLSTM training regime, rendering the reported performance differences non-reproducible and the fairness of the comparison unclear.
  3. [§5] §5 (Results): Claims that BiLSTM attains the highest weighted F1 while ML+SMOTE is more balanced lack statistical significance tests (e.g., McNemar or bootstrap) and ablations isolating the effect of SMOTE; without these, the central comparative conclusion rests on point estimates whose reliability cannot be assessed.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'machine learning with SMOTE' is ambiguous; it should specify which of the three classical models received SMOTE and whether the BiLSTM used any imbalance mitigation.
  2. [§2] §2 (Related Work): Recent transformer-based Indonesian sentiment models are not cited, which would strengthen the positioning of the BiLSTM baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the reproducibility and robustness of our analysis. We address each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Data Collection and Preprocessing): The scraping (100k initial) and cleaning pipeline is described but not validated against external distributions; no analysis shows whether neutral reviews (already minority) were disproportionately affected by slang normalization or stemming, which directly threatens the generalizability of the BiLSTM vs. ML+SMOTE comparison.

    Authors: We agree that an explicit check on the impact of preprocessing on class balance would strengthen the work. In the revision we will add a table reporting the number of samples per class (positive, negative, neutral) at each stage of the pipeline (raw, after slang normalization, after stopword removal, after stemming). This will directly show whether the minority neutral class was disproportionately reduced. For validation against external distributions, no comparable public Indonesian Spotify review corpus exists; we will therefore add a limitations paragraph noting this constraint and the resulting potential for domain-specific bias. revision: partial

  2. Referee: [§4.1] §4.1 (Experimental Setup): No information is given on train-test split ratio or stratification, hyperparameter search procedure for any model (including BiLSTM layer sizes, embedding, epochs), or whether SMOTE was applied to the BiLSTM training regime, rendering the reported performance differences non-reproducible and the fairness of the comparison unclear.

    Authors: We acknowledge that these details are essential for reproducibility. The revised manuscript will state the train-test split ratio employed, whether the split was stratified by sentiment label, the hyperparameter search procedure used for the classical models and for the BiLSTM (including how layer sizes, embedding dimension, and training epochs were selected), and whether SMOTE was applied to the BiLSTM training data. We will also list the final hyperparameter values chosen for each model. revision: yes

  3. Referee: [§5] §5 (Results): Claims that BiLSTM attains the highest weighted F1 while ML+SMOTE is more balanced lack statistical significance tests (e.g., McNemar or bootstrap) and ablations isolating the effect of SMOTE; without these, the central comparative conclusion rests on point estimates whose reliability cannot be assessed.

    Authors: We accept that statistical testing and ablation studies are required to support the comparative claims. In the revision we will add McNemar’s test (or bootstrap confidence intervals) for the key pairwise comparisons and report the resulting p-values. We will also include an ablation table showing the performance of the classical models both with and without SMOTE, thereby isolating the contribution of oversampling to the observed per-class balance. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical ML benchmarking on scraped corpus

full rationale

The paper reports results from training and evaluating off-the-shelf classifiers (SVM, Multinomial Naive Bayes, Decision Tree, BiLSTM) on a fixed scraped-and-cleaned dataset of Indonesian Spotify reviews. All steps are standard preprocessing (slang normalization, stopword removal, stemming) followed by direct performance measurement via weighted F1 and per-class metrics. No equations, derivations, or parameter-fitting procedures are present that could reduce a claimed prediction to its own inputs by construction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The central comparison (BiLSTM highest weighted F1 but weak on neutral class; ML+SMOTE more balanced) is a straightforward empirical outcome on the given corpus and does not rely on any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions and routine preprocessing choices with no new free parameters, axioms, or invented entities introduced by the paper.

axioms (1)
  • domain assumption Scraped online reviews form an independent and identically distributed sample suitable for supervised classification after standard cleaning
    Implicit in any sentiment analysis study using web-scraped text data.

pith-pipeline@v0.9.0 · 5435 in / 1303 out tokens · 50223 ms · 2026-05-07T16:45:17.810434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 10 canonical work pages

  1. [1]

    Opinion mining and sentiment analysis.Foundations and Trends in Information Retrieval, 2(1–2): 1–135, 2008

    Bo Pang and Lillian Lee. Opinion mining and sentiment analysis.Foundations and Trends in Information Retrieval, 2(1–2): 1–135, 2008. doi:10.1561/1500000011

  2. [2]

    Comparative sentiment analysis of app reviews.arXiv preprint arXiv:2006.09739, 2020

    Ananya Pratap and Nidhi. Comparative sentiment analysis of app reviews.arXiv preprint arXiv:2006.09739, 2020. URL https://arxiv.org/abs/2006.09739

  3. [3]

    A comparative study of sentiment analysis using NLP and different machine learning techniques on US airline Twitter data.arXiv preprint arXiv:2110.00859, 2021

    Furqan Rustam, Imran Ashraf, Arif Mehmood, Saleem Ullah, et al. A comparative study of sentiment analysis using NLP and different machine learning techniques on US airline Twitter data.arXiv preprint arXiv:2110.00859, 2021. URL https://arxiv.org/abs/2110.00859

  4. [4]

    1997, Neural Computation, 9, 1735, 10.1162/neco.1997.9.8.1735

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. doi:10.1162/neco.1997.9.8.1735

  5. [5]

    Sentiment analysis of comment texts based on BiLSTM.IEEE Access, 7:51522–51532, 2019

    Guixian Xu, Yuxin Meng, Xuxiang Qiu, Zhenlong Yu, and Xueliang Wu. Sentiment analysis of comment texts based on BiLSTM.IEEE Access, 7:51522–51532, 2019. doi:10.1109/ACCESS.2019.2911964

  6. [6]

    Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy.Journal of Big Data, 10(1):88, 2023

    Chih-Hsueh Lin and Ulin Nuha. Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy.Journal of Big Data, 10(1):88, 2023. doi:10.1186/s40537-023-00782-9

  7. [7]

    IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding

    Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. InProceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for...

  8. [8]

    Improving Bi-LSTM performance for Indonesian sentiment analysis using paragraph vector.arXiv preprint arXiv:2009.05720, 2020

    Setyo Kuncahyo et al. Improving Bi-LSTM performance for Indonesian sentiment analysis using paragraph vector.arXiv preprint arXiv:2009.05720, 2020. URLhttps://arxiv.org/abs/2009.05720

  9. [9]

    Klasifikasi sentimen menggunakan bidirectional LSTM dan IndoBERT dengan dataset terbatas

    Muhammad Farizki et al. Klasifikasi sentimen menggunakan bidirectional LSTM dan IndoBERT dengan dataset terbatas. ZONAsi: Jurnal Sistem Informasi, 2025. URL https://journal.unilak.ac.id/index.php/zn/article/view/25091

  10. [10]

    Sentiment classification on the 2024 Indonesian presidential candidate dataset using deep learning approaches.Indonesian Journal of Statistics and Its Applications, 2024

    Oluwaseun Abiola et al. Sentiment classification on the 2024 Indonesian presidential candidate dataset using deep learning approaches.Indonesian Journal of Statistics and Its Applications, 2024. URL https://journal-stats.ipb.ac.id/ index.php/ijsa/article/view/1259

  11. [11]

    Mirna Adriani, Jelita Asian, Bobby Nazief, Saied M. M. Tahaghoghi, and Hugh E. Williams. Stemming Indonesian.ACM Transactions on Asian Language Information Processing, 6(4):1–33, 2007. doi:10.1145/1316457.1316458

  12. [12]

    Sentiment analysis on Google Play Store app users’ reviews based on deep learning approach.Multimedia Tools and Applications, 2024

    Chukwuemeka Ivan Ossai and Nilmini Wickramasinghe. Sentiment analysis on Google Play Store app users’ reviews based on deep learning approach.Multimedia Tools and Applications, 2024. doi:10.1007/s11042-024-19185-w

  13. [13]

    SMOTE: synthetic minority over-sampling technique,

    Nitesh V . Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research, 16:321–357, 2002. doi:10.1613/jair.953. 8