Sentiment Analysis of Indonesian Spotify Reviews Using Machine Learning and BiLSTM

Andre Hadiman Rotua Parhusip; Ardika Satria; Luluk Muthoharoh; Martin C. T. Manullang; Sahid Maulana; Uliano Wilyam Purba

arxiv: 2605.03443 · v1 · submitted 2026-05-05 · 💻 cs.CL

Sentiment Analysis of Indonesian Spotify Reviews Using Machine Learning and BiLSTM

Uliano Wilyam Purba , Andre Hadiman Rotua Parhusip , Sahid Maulana , Luluk Muthoharoh , Ardika Satria , Martin C. T. Manullang This is my paper

Pith reviewed 2026-05-07 16:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords sentiment analysisIndonesian languageSpotify reviewsBiLSTMmachine learningtext classificationSMOTEnatural language processing

0 comments

The pith

BiLSTM achieves the highest weighted F1-score for three-class sentiment classification of Indonesian Spotify reviews but underperforms on the neutral class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks Support Vector Machine, Multinomial Naive Bayes, Decision Tree, and a two-layer BiLSTM on 70,155 cleaned Indonesian Spotify reviews for positive, negative, and neutral labels. All models share the same pipeline of slang normalization, stopword removal, and stemming, with SMOTE applied to address imbalance in the machine learning runs. BiLSTM leads on the aggregate weighted F1 metric, yet Decision Tree with SMOTE delivers more even accuracy across the three classes. A sympathetic reader would care because the comparison reveals concrete trade-offs when deploying sequence models versus classical classifiers on imbalanced, non-English social text from a consumer platform.

Core claim

The paper establishes that BiLSTM attains the highest weighted F1-score overall in three-class sentiment classification of Indonesian Spotify reviews, outperforming the classical models, but fails on the minority neutral class, whereas machine learning approaches that incorporate SMOTE achieve more balanced performance across positive, negative, and neutral categories.

What carries the argument

The two-layer BiLSTM that processes the preprocessed Indonesian text bidirectionally to capture forward and backward context for sentiment prediction.

If this is right

BiLSTM is the stronger choice when the priority is aggregate sentiment detection accuracy rather than per-class equity.
Machine learning models paired with SMOTE are preferable when balanced detection across positive, negative, and neutral is required.
Decision Tree is the strongest performer among the classical models under the shared preprocessing pipeline.
The preprocessing steps of slang normalization, stopword removal, and stemming effectively handle Indonesian-specific language features in review text.
Class imbalance in scraped review data causes models to favor majority sentiments unless countered by balancing techniques like SMOTE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid pipelines that feed SMOTE-balanced data into a BiLSTM could close the gap on the neutral class while retaining overall performance gains.
The pattern may generalize to other low-resource languages or review platforms, suggesting similar benchmarks could guide model selection elsewhere.
Customer analytics tools for music apps might route high-confidence positive or negative cases through BiLSTM and flag neutral cases for the more balanced ML path.
Larger or more deliberately sampled datasets would be needed to test whether BiLSTM's neutral-class weakness is inherent or simply a product of the current collection method.

Load-bearing premise

The scraped and cleaned dataset of 70,155 samples is representative of real-world Indonesian Spotify review sentiments without major bias introduced by the scraping process, cleaning rules, or class imbalance handling.

What would settle it

Running the identical models and preprocessing on a fresh collection of Indonesian Spotify reviews gathered through a different channel, such as an official API sample, that reverses the ranking on weighted F1 or equalizes neutral-class performance.

Figures

Figures reproduced from arXiv: 2605.03443 by Andre Hadiman Rotua Parhusip, Ardika Satria, Luluk Muthoharoh, Martin C. T. Manullang, Sahid Maulana, Uliano Wilyam Purba.

**Figure 1.** Figure 1: Sentiment distribution rating label from raw data. view at source ↗

**Figure 2.** Figure 2: Training and Validation Loss/Accuracy Curves during the optimization phase. view at source ↗

**Figure 3.** Figure 3: Performance comparison of the three Scikit-learn classifiers across four evaluation metrics. view at source ↗

**Figure 4.** Figure 4: Per-class classification report of the Decision Tree classifier on the held-out test fold. view at source ↗

**Figure 5.** Figure 5: Confusion matrix of the Decision Tree classifier on the test fold (balanced via SMOTE). view at source ↗

**Figure 6.** Figure 6: Confusion matrix of the BiLSTM model on the held-out test set ( view at source ↗

**Figure 7.** Figure 7: BiLSTM training and validation loss (left) and accuracy (right) across six epochs. view at source ↗

read the original abstract

This paper benchmarks classical machine learning and deep learning approaches for three-class sentiment classification of Indonesian Spotify reviews. Using 100,000 scraped reviews and 70,155 cleaned samples, the study compares Support Vector Machine, Multinomial Naive Bayes, and Decision Tree models with a two-layer BiLSTM. Both approaches use the same preprocessing pipeline, including slang normalization, stopword removal, and stemming. Decision Tree achieves the best performance among the classical models, while BiLSTM attains the highest weighted F1-score overall but fails on the minority neutral class. The paper concludes that BiLSTM is stronger for overall sentiment detection, whereas machine learning with SMOTE provides more balanced three-class performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A straightforward benchmark applying standard models to a new Indonesian Spotify review dataset, where BiLSTM leads on weighted F1 but lags on the neutral class while Decision Tree with SMOTE balances better.

read the letter

This paper benchmarks SVM, Multinomial Naive Bayes, Decision Tree, and a two-layer BiLSTM on three-class sentiment classification using 70,155 cleaned Indonesian Spotify reviews scraped from an initial 100k. The core finding is that BiLSTM achieves the highest weighted F1 overall but performs poorly on the minority neutral class, whereas Decision Tree with SMOTE delivers more balanced results across positive, negative, and neutral labels. The preprocessing pipeline of slang normalization, stopword removal, and stemming is applied uniformly, which keeps the comparison clean on that front. What is actually new here is the corpus itself, which supplies concrete numbers for Indonesian, a language with limited prior sentiment benchmarks in this domain. The paper does well at surfacing the practical trade-off between overall performance and class balance, and it avoids overclaiming by noting the neutral-class weakness explicitly. The conclusions track directly from the reported outcomes without circular reasoning or invented entities. The soft spots are real but not fatal. Details on train-test splits, hyperparameter search, exact metric definitions, and statistical significance testing are missing, which makes reproducibility harder. The scraping and cleaning process could introduce selection bias, particularly if neutral reviews were filtered more aggressively by the language rules, and applying SMOTE only to the classical models creates an asymmetric setup that is not shown to be neutral for the BiLSTM. These points weaken how far the comparative claims can be generalized beyond this specific corpus. This work is for practitioners or researchers doing applied sentiment analysis on non-English product reviews or similar low-resource settings. A reader needing baseline numbers for Indonesian text would get modest practical value from the results. The paper shows clear, honest engagement with the task and literature, so it deserves a serious referee. I would recommend sending it for peer review, with the expectation that the authors supply the missing experimental details and address the dataset representativeness.

Referee Report

3 major / 2 minor

Summary. The paper benchmarks classical machine learning models (SVM, Multinomial Naive Bayes, Decision Tree) against a two-layer BiLSTM for three-class sentiment classification of Indonesian Spotify reviews. Starting from 100,000 scraped reviews cleaned to 70,155 samples via slang normalization, stopword removal, and stemming, it reports that Decision Tree performs best among the classical models while BiLSTM achieves the highest weighted F1-score overall but underperforms on the minority neutral class; the conclusion recommends BiLSTM for overall sentiment detection and ML with SMOTE for more balanced three-class results.

Significance. If the experimental details are supplied and dataset biases addressed, the work supplies a new domain-specific Indonesian dataset and a practical comparison of aggregate versus per-class performance trade-offs. This can inform model choice for similar low-resource sentiment tasks, particularly where class imbalance is present.

major comments (3)

[§3] §3 (Data Collection and Preprocessing): The scraping (100k initial) and cleaning pipeline is described but not validated against external distributions; no analysis shows whether neutral reviews (already minority) were disproportionately affected by slang normalization or stemming, which directly threatens the generalizability of the BiLSTM vs. ML+SMOTE comparison.
[§4.1] §4.1 (Experimental Setup): No information is given on train-test split ratio or stratification, hyperparameter search procedure for any model (including BiLSTM layer sizes, embedding, epochs), or whether SMOTE was applied to the BiLSTM training regime, rendering the reported performance differences non-reproducible and the fairness of the comparison unclear.
[§5] §5 (Results): Claims that BiLSTM attains the highest weighted F1 while ML+SMOTE is more balanced lack statistical significance tests (e.g., McNemar or bootstrap) and ablations isolating the effect of SMOTE; without these, the central comparative conclusion rests on point estimates whose reliability cannot be assessed.

minor comments (2)

[Abstract] Abstract: The phrase 'machine learning with SMOTE' is ambiguous; it should specify which of the three classical models received SMOTE and whether the BiLSTM used any imbalance mitigation.
[§2] §2 (Related Work): Recent transformer-based Indonesian sentiment models are not cited, which would strengthen the positioning of the BiLSTM baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the reproducibility and robustness of our analysis. We address each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Data Collection and Preprocessing): The scraping (100k initial) and cleaning pipeline is described but not validated against external distributions; no analysis shows whether neutral reviews (already minority) were disproportionately affected by slang normalization or stemming, which directly threatens the generalizability of the BiLSTM vs. ML+SMOTE comparison.

Authors: We agree that an explicit check on the impact of preprocessing on class balance would strengthen the work. In the revision we will add a table reporting the number of samples per class (positive, negative, neutral) at each stage of the pipeline (raw, after slang normalization, after stopword removal, after stemming). This will directly show whether the minority neutral class was disproportionately reduced. For validation against external distributions, no comparable public Indonesian Spotify review corpus exists; we will therefore add a limitations paragraph noting this constraint and the resulting potential for domain-specific bias. revision: partial
Referee: [§4.1] §4.1 (Experimental Setup): No information is given on train-test split ratio or stratification, hyperparameter search procedure for any model (including BiLSTM layer sizes, embedding, epochs), or whether SMOTE was applied to the BiLSTM training regime, rendering the reported performance differences non-reproducible and the fairness of the comparison unclear.

Authors: We acknowledge that these details are essential for reproducibility. The revised manuscript will state the train-test split ratio employed, whether the split was stratified by sentiment label, the hyperparameter search procedure used for the classical models and for the BiLSTM (including how layer sizes, embedding dimension, and training epochs were selected), and whether SMOTE was applied to the BiLSTM training data. We will also list the final hyperparameter values chosen for each model. revision: yes
Referee: [§5] §5 (Results): Claims that BiLSTM attains the highest weighted F1 while ML+SMOTE is more balanced lack statistical significance tests (e.g., McNemar or bootstrap) and ablations isolating the effect of SMOTE; without these, the central comparative conclusion rests on point estimates whose reliability cannot be assessed.

Authors: We accept that statistical testing and ablation studies are required to support the comparative claims. In the revision we will add McNemar’s test (or bootstrap confidence intervals) for the key pairwise comparisons and report the resulting p-values. We will also include an ablation table showing the performance of the classical models both with and without SMOTE, thereby isolating the contribution of oversampling to the observed per-class balance. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical ML benchmarking on scraped corpus

full rationale

The paper reports results from training and evaluating off-the-shelf classifiers (SVM, Multinomial Naive Bayes, Decision Tree, BiLSTM) on a fixed scraped-and-cleaned dataset of Indonesian Spotify reviews. All steps are standard preprocessing (slang normalization, stopword removal, stemming) followed by direct performance measurement via weighted F1 and per-class metrics. No equations, derivations, or parameter-fitting procedures are present that could reduce a claimed prediction to its own inputs by construction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The central comparison (BiLSTM highest weighted F1 but weak on neutral class; ML+SMOTE more balanced) is a straightforward empirical outcome on the given corpus and does not rely on any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions and routine preprocessing choices with no new free parameters, axioms, or invented entities introduced by the paper.

axioms (1)

domain assumption Scraped online reviews form an independent and identically distributed sample suitable for supervised classification after standard cleaning
Implicit in any sentiment analysis study using web-scraped text data.

pith-pipeline@v0.9.0 · 5435 in / 1303 out tokens · 50223 ms · 2026-05-07T16:45:17.810434+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 10 canonical work pages

[1]

Opinion mining and sentiment analysis.Foundations and Trends in Information Retrieval, 2(1–2): 1–135, 2008

Bo Pang and Lillian Lee. Opinion mining and sentiment analysis.Foundations and Trends in Information Retrieval, 2(1–2): 1–135, 2008. doi:10.1561/1500000011

work page doi:10.1561/1500000011 2008
[2]

Comparative sentiment analysis of app reviews.arXiv preprint arXiv:2006.09739, 2020

Ananya Pratap and Nidhi. Comparative sentiment analysis of app reviews.arXiv preprint arXiv:2006.09739, 2020. URL https://arxiv.org/abs/2006.09739

work page arXiv 2006
[3]

A comparative study of sentiment analysis using NLP and different machine learning techniques on US airline Twitter data.arXiv preprint arXiv:2110.00859, 2021

Furqan Rustam, Imran Ashraf, Arif Mehmood, Saleem Ullah, et al. A comparative study of sentiment analysis using NLP and different machine learning techniques on US airline Twitter data.arXiv preprint arXiv:2110.00859, 2021. URL https://arxiv.org/abs/2110.00859

work page arXiv 2021
[4]

1997, Neural Computation, 9, 1735, 10.1162/neco.1997.9.8.1735

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. doi:10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[5]

Sentiment analysis of comment texts based on BiLSTM.IEEE Access, 7:51522–51532, 2019

Guixian Xu, Yuxin Meng, Xuxiang Qiu, Zhenlong Yu, and Xueliang Wu. Sentiment analysis of comment texts based on BiLSTM.IEEE Access, 7:51522–51532, 2019. doi:10.1109/ACCESS.2019.2911964

work page doi:10.1109/access.2019.2911964 2019
[6]

Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy.Journal of Big Data, 10(1):88, 2023

Chih-Hsueh Lin and Ulin Nuha. Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy.Journal of Big Data, 10(1):88, 2023. doi:10.1186/s40537-023-00782-9

work page doi:10.1186/s40537-023-00782-9 2023
[7]

IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding

Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. InProceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for...

2020
[8]

Improving Bi-LSTM performance for Indonesian sentiment analysis using paragraph vector.arXiv preprint arXiv:2009.05720, 2020

Setyo Kuncahyo et al. Improving Bi-LSTM performance for Indonesian sentiment analysis using paragraph vector.arXiv preprint arXiv:2009.05720, 2020. URLhttps://arxiv.org/abs/2009.05720

work page arXiv 2009
[9]

Klasifikasi sentimen menggunakan bidirectional LSTM dan IndoBERT dengan dataset terbatas

Muhammad Farizki et al. Klasifikasi sentimen menggunakan bidirectional LSTM dan IndoBERT dengan dataset terbatas. ZONAsi: Jurnal Sistem Informasi, 2025. URL https://journal.unilak.ac.id/index.php/zn/article/view/25091

2025
[10]

Sentiment classification on the 2024 Indonesian presidential candidate dataset using deep learning approaches.Indonesian Journal of Statistics and Its Applications, 2024

Oluwaseun Abiola et al. Sentiment classification on the 2024 Indonesian presidential candidate dataset using deep learning approaches.Indonesian Journal of Statistics and Its Applications, 2024. URL https://journal-stats.ipb.ac.id/ index.php/ijsa/article/view/1259

2024
[11]

Mirna Adriani, Jelita Asian, Bobby Nazief, Saied M. M. Tahaghoghi, and Hugh E. Williams. Stemming Indonesian.ACM Transactions on Asian Language Information Processing, 6(4):1–33, 2007. doi:10.1145/1316457.1316458

work page doi:10.1145/1316457.1316458 2007
[12]

Sentiment analysis on Google Play Store app users’ reviews based on deep learning approach.Multimedia Tools and Applications, 2024

Chukwuemeka Ivan Ossai and Nilmini Wickramasinghe. Sentiment analysis on Google Play Store app users’ reviews based on deep learning approach.Multimedia Tools and Applications, 2024. doi:10.1007/s11042-024-19185-w

work page doi:10.1007/s11042-024-19185-w 2024
[13]

SMOTE: synthetic minority over-sampling technique,

Nitesh V . Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research, 16:321–357, 2002. doi:10.1613/jair.953. 8

work page doi:10.1613/jair.953 2002

[1] [1]

Opinion mining and sentiment analysis.Foundations and Trends in Information Retrieval, 2(1–2): 1–135, 2008

Bo Pang and Lillian Lee. Opinion mining and sentiment analysis.Foundations and Trends in Information Retrieval, 2(1–2): 1–135, 2008. doi:10.1561/1500000011

work page doi:10.1561/1500000011 2008

[2] [2]

Comparative sentiment analysis of app reviews.arXiv preprint arXiv:2006.09739, 2020

Ananya Pratap and Nidhi. Comparative sentiment analysis of app reviews.arXiv preprint arXiv:2006.09739, 2020. URL https://arxiv.org/abs/2006.09739

work page arXiv 2006

[3] [3]

A comparative study of sentiment analysis using NLP and different machine learning techniques on US airline Twitter data.arXiv preprint arXiv:2110.00859, 2021

Furqan Rustam, Imran Ashraf, Arif Mehmood, Saleem Ullah, et al. A comparative study of sentiment analysis using NLP and different machine learning techniques on US airline Twitter data.arXiv preprint arXiv:2110.00859, 2021. URL https://arxiv.org/abs/2110.00859

work page arXiv 2021

[4] [4]

1997, Neural Computation, 9, 1735, 10.1162/neco.1997.9.8.1735

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. doi:10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997

[5] [5]

Sentiment analysis of comment texts based on BiLSTM.IEEE Access, 7:51522–51532, 2019

Guixian Xu, Yuxin Meng, Xuxiang Qiu, Zhenlong Yu, and Xueliang Wu. Sentiment analysis of comment texts based on BiLSTM.IEEE Access, 7:51522–51532, 2019. doi:10.1109/ACCESS.2019.2911964

work page doi:10.1109/access.2019.2911964 2019

[6] [6]

Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy.Journal of Big Data, 10(1):88, 2023

Chih-Hsueh Lin and Ulin Nuha. Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy.Journal of Big Data, 10(1):88, 2023. doi:10.1186/s40537-023-00782-9

work page doi:10.1186/s40537-023-00782-9 2023

[7] [7]

IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding

Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. InProceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for...

2020

[8] [8]

Improving Bi-LSTM performance for Indonesian sentiment analysis using paragraph vector.arXiv preprint arXiv:2009.05720, 2020

Setyo Kuncahyo et al. Improving Bi-LSTM performance for Indonesian sentiment analysis using paragraph vector.arXiv preprint arXiv:2009.05720, 2020. URLhttps://arxiv.org/abs/2009.05720

work page arXiv 2009

[9] [9]

Klasifikasi sentimen menggunakan bidirectional LSTM dan IndoBERT dengan dataset terbatas

Muhammad Farizki et al. Klasifikasi sentimen menggunakan bidirectional LSTM dan IndoBERT dengan dataset terbatas. ZONAsi: Jurnal Sistem Informasi, 2025. URL https://journal.unilak.ac.id/index.php/zn/article/view/25091

2025

[10] [10]

Sentiment classification on the 2024 Indonesian presidential candidate dataset using deep learning approaches.Indonesian Journal of Statistics and Its Applications, 2024

Oluwaseun Abiola et al. Sentiment classification on the 2024 Indonesian presidential candidate dataset using deep learning approaches.Indonesian Journal of Statistics and Its Applications, 2024. URL https://journal-stats.ipb.ac.id/ index.php/ijsa/article/view/1259

2024

[11] [11]

Mirna Adriani, Jelita Asian, Bobby Nazief, Saied M. M. Tahaghoghi, and Hugh E. Williams. Stemming Indonesian.ACM Transactions on Asian Language Information Processing, 6(4):1–33, 2007. doi:10.1145/1316457.1316458

work page doi:10.1145/1316457.1316458 2007

[12] [12]

Sentiment analysis on Google Play Store app users’ reviews based on deep learning approach.Multimedia Tools and Applications, 2024

Chukwuemeka Ivan Ossai and Nilmini Wickramasinghe. Sentiment analysis on Google Play Store app users’ reviews based on deep learning approach.Multimedia Tools and Applications, 2024. doi:10.1007/s11042-024-19185-w

work page doi:10.1007/s11042-024-19185-w 2024

[13] [13]

SMOTE: synthetic minority over-sampling technique,

Nitesh V . Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research, 16:321–357, 2002. doi:10.1613/jair.953. 8

work page doi:10.1613/jair.953 2002