Sentiment Analysis of Indonesian Spotify Reviews Using Machine Learning and BiLSTM
Pith reviewed 2026-05-07 16:45 UTC · model grok-4.3
The pith
BiLSTM achieves the highest weighted F1-score for three-class sentiment classification of Indonesian Spotify reviews but underperforms on the neutral class.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that BiLSTM attains the highest weighted F1-score overall in three-class sentiment classification of Indonesian Spotify reviews, outperforming the classical models, but fails on the minority neutral class, whereas machine learning approaches that incorporate SMOTE achieve more balanced performance across positive, negative, and neutral categories.
What carries the argument
The two-layer BiLSTM that processes the preprocessed Indonesian text bidirectionally to capture forward and backward context for sentiment prediction.
If this is right
- BiLSTM is the stronger choice when the priority is aggregate sentiment detection accuracy rather than per-class equity.
- Machine learning models paired with SMOTE are preferable when balanced detection across positive, negative, and neutral is required.
- Decision Tree is the strongest performer among the classical models under the shared preprocessing pipeline.
- The preprocessing steps of slang normalization, stopword removal, and stemming effectively handle Indonesian-specific language features in review text.
- Class imbalance in scraped review data causes models to favor majority sentiments unless countered by balancing techniques like SMOTE.
Where Pith is reading between the lines
- Hybrid pipelines that feed SMOTE-balanced data into a BiLSTM could close the gap on the neutral class while retaining overall performance gains.
- The pattern may generalize to other low-resource languages or review platforms, suggesting similar benchmarks could guide model selection elsewhere.
- Customer analytics tools for music apps might route high-confidence positive or negative cases through BiLSTM and flag neutral cases for the more balanced ML path.
- Larger or more deliberately sampled datasets would be needed to test whether BiLSTM's neutral-class weakness is inherent or simply a product of the current collection method.
Load-bearing premise
The scraped and cleaned dataset of 70,155 samples is representative of real-world Indonesian Spotify review sentiments without major bias introduced by the scraping process, cleaning rules, or class imbalance handling.
What would settle it
Running the identical models and preprocessing on a fresh collection of Indonesian Spotify reviews gathered through a different channel, such as an official API sample, that reverses the ranking on weighted F1 or equalizes neutral-class performance.
Figures
read the original abstract
This paper benchmarks classical machine learning and deep learning approaches for three-class sentiment classification of Indonesian Spotify reviews. Using 100,000 scraped reviews and 70,155 cleaned samples, the study compares Support Vector Machine, Multinomial Naive Bayes, and Decision Tree models with a two-layer BiLSTM. Both approaches use the same preprocessing pipeline, including slang normalization, stopword removal, and stemming. Decision Tree achieves the best performance among the classical models, while BiLSTM attains the highest weighted F1-score overall but fails on the minority neutral class. The paper concludes that BiLSTM is stronger for overall sentiment detection, whereas machine learning with SMOTE provides more balanced three-class performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks classical machine learning models (SVM, Multinomial Naive Bayes, Decision Tree) against a two-layer BiLSTM for three-class sentiment classification of Indonesian Spotify reviews. Starting from 100,000 scraped reviews cleaned to 70,155 samples via slang normalization, stopword removal, and stemming, it reports that Decision Tree performs best among the classical models while BiLSTM achieves the highest weighted F1-score overall but underperforms on the minority neutral class; the conclusion recommends BiLSTM for overall sentiment detection and ML with SMOTE for more balanced three-class results.
Significance. If the experimental details are supplied and dataset biases addressed, the work supplies a new domain-specific Indonesian dataset and a practical comparison of aggregate versus per-class performance trade-offs. This can inform model choice for similar low-resource sentiment tasks, particularly where class imbalance is present.
major comments (3)
- [§3] §3 (Data Collection and Preprocessing): The scraping (100k initial) and cleaning pipeline is described but not validated against external distributions; no analysis shows whether neutral reviews (already minority) were disproportionately affected by slang normalization or stemming, which directly threatens the generalizability of the BiLSTM vs. ML+SMOTE comparison.
- [§4.1] §4.1 (Experimental Setup): No information is given on train-test split ratio or stratification, hyperparameter search procedure for any model (including BiLSTM layer sizes, embedding, epochs), or whether SMOTE was applied to the BiLSTM training regime, rendering the reported performance differences non-reproducible and the fairness of the comparison unclear.
- [§5] §5 (Results): Claims that BiLSTM attains the highest weighted F1 while ML+SMOTE is more balanced lack statistical significance tests (e.g., McNemar or bootstrap) and ablations isolating the effect of SMOTE; without these, the central comparative conclusion rests on point estimates whose reliability cannot be assessed.
minor comments (2)
- [Abstract] Abstract: The phrase 'machine learning with SMOTE' is ambiguous; it should specify which of the three classical models received SMOTE and whether the BiLSTM used any imbalance mitigation.
- [§2] §2 (Related Work): Recent transformer-based Indonesian sentiment models are not cited, which would strengthen the positioning of the BiLSTM baseline.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas for improving the reproducibility and robustness of our analysis. We address each major comment below and indicate the changes planned for the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Data Collection and Preprocessing): The scraping (100k initial) and cleaning pipeline is described but not validated against external distributions; no analysis shows whether neutral reviews (already minority) were disproportionately affected by slang normalization or stemming, which directly threatens the generalizability of the BiLSTM vs. ML+SMOTE comparison.
Authors: We agree that an explicit check on the impact of preprocessing on class balance would strengthen the work. In the revision we will add a table reporting the number of samples per class (positive, negative, neutral) at each stage of the pipeline (raw, after slang normalization, after stopword removal, after stemming). This will directly show whether the minority neutral class was disproportionately reduced. For validation against external distributions, no comparable public Indonesian Spotify review corpus exists; we will therefore add a limitations paragraph noting this constraint and the resulting potential for domain-specific bias. revision: partial
-
Referee: [§4.1] §4.1 (Experimental Setup): No information is given on train-test split ratio or stratification, hyperparameter search procedure for any model (including BiLSTM layer sizes, embedding, epochs), or whether SMOTE was applied to the BiLSTM training regime, rendering the reported performance differences non-reproducible and the fairness of the comparison unclear.
Authors: We acknowledge that these details are essential for reproducibility. The revised manuscript will state the train-test split ratio employed, whether the split was stratified by sentiment label, the hyperparameter search procedure used for the classical models and for the BiLSTM (including how layer sizes, embedding dimension, and training epochs were selected), and whether SMOTE was applied to the BiLSTM training data. We will also list the final hyperparameter values chosen for each model. revision: yes
-
Referee: [§5] §5 (Results): Claims that BiLSTM attains the highest weighted F1 while ML+SMOTE is more balanced lack statistical significance tests (e.g., McNemar or bootstrap) and ablations isolating the effect of SMOTE; without these, the central comparative conclusion rests on point estimates whose reliability cannot be assessed.
Authors: We accept that statistical testing and ablation studies are required to support the comparative claims. In the revision we will add McNemar’s test (or bootstrap confidence intervals) for the key pairwise comparisons and report the resulting p-values. We will also include an ablation table showing the performance of the classical models both with and without SMOTE, thereby isolating the contribution of oversampling to the observed per-class balance. revision: yes
Circularity Check
No circularity: standard empirical ML benchmarking on scraped corpus
full rationale
The paper reports results from training and evaluating off-the-shelf classifiers (SVM, Multinomial Naive Bayes, Decision Tree, BiLSTM) on a fixed scraped-and-cleaned dataset of Indonesian Spotify reviews. All steps are standard preprocessing (slang normalization, stopword removal, stemming) followed by direct performance measurement via weighted F1 and per-class metrics. No equations, derivations, or parameter-fitting procedures are present that could reduce a claimed prediction to its own inputs by construction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The central comparison (BiLSTM highest weighted F1 but weak on neutral class; ML+SMOTE more balanced) is a straightforward empirical outcome on the given corpus and does not rely on any self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scraped online reviews form an independent and identically distributed sample suitable for supervised classification after standard cleaning
Reference graph
Works this paper leans on
-
[1]
Bo Pang and Lillian Lee. Opinion mining and sentiment analysis.Foundations and Trends in Information Retrieval, 2(1–2): 1–135, 2008. doi:10.1561/1500000011
-
[2]
Comparative sentiment analysis of app reviews.arXiv preprint arXiv:2006.09739, 2020
Ananya Pratap and Nidhi. Comparative sentiment analysis of app reviews.arXiv preprint arXiv:2006.09739, 2020. URL https://arxiv.org/abs/2006.09739
-
[3]
Furqan Rustam, Imran Ashraf, Arif Mehmood, Saleem Ullah, et al. A comparative study of sentiment analysis using NLP and different machine learning techniques on US airline Twitter data.arXiv preprint arXiv:2110.00859, 2021. URL https://arxiv.org/abs/2110.00859
-
[4]
1997, Neural Computation, 9, 1735, 10.1162/neco.1997.9.8.1735
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. doi:10.1162/neco.1997.9.8.1735
-
[5]
Sentiment analysis of comment texts based on BiLSTM.IEEE Access, 7:51522–51532, 2019
Guixian Xu, Yuxin Meng, Xuxiang Qiu, Zhenlong Yu, and Xueliang Wu. Sentiment analysis of comment texts based on BiLSTM.IEEE Access, 7:51522–51532, 2019. doi:10.1109/ACCESS.2019.2911964
-
[6]
Chih-Hsueh Lin and Ulin Nuha. Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy.Journal of Big Data, 10(1):88, 2023. doi:10.1186/s40537-023-00782-9
-
[7]
IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding
Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. InProceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for...
2020
-
[8]
Setyo Kuncahyo et al. Improving Bi-LSTM performance for Indonesian sentiment analysis using paragraph vector.arXiv preprint arXiv:2009.05720, 2020. URLhttps://arxiv.org/abs/2009.05720
-
[9]
Klasifikasi sentimen menggunakan bidirectional LSTM dan IndoBERT dengan dataset terbatas
Muhammad Farizki et al. Klasifikasi sentimen menggunakan bidirectional LSTM dan IndoBERT dengan dataset terbatas. ZONAsi: Jurnal Sistem Informasi, 2025. URL https://journal.unilak.ac.id/index.php/zn/article/view/25091
2025
-
[10]
Sentiment classification on the 2024 Indonesian presidential candidate dataset using deep learning approaches.Indonesian Journal of Statistics and Its Applications, 2024
Oluwaseun Abiola et al. Sentiment classification on the 2024 Indonesian presidential candidate dataset using deep learning approaches.Indonesian Journal of Statistics and Its Applications, 2024. URL https://journal-stats.ipb.ac.id/ index.php/ijsa/article/view/1259
2024
-
[11]
Mirna Adriani, Jelita Asian, Bobby Nazief, Saied M. M. Tahaghoghi, and Hugh E. Williams. Stemming Indonesian.ACM Transactions on Asian Language Information Processing, 6(4):1–33, 2007. doi:10.1145/1316457.1316458
-
[12]
Chukwuemeka Ivan Ossai and Nilmini Wickramasinghe. Sentiment analysis on Google Play Store app users’ reviews based on deep learning approach.Multimedia Tools and Applications, 2024. doi:10.1007/s11042-024-19185-w
-
[13]
SMOTE: synthetic minority over-sampling technique,
Nitesh V . Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research, 16:321–357, 2002. doi:10.1613/jair.953. 8
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.