Sentiment Analysis of Mobile Legends App Reviews Using Machine Learning and LSTM-Based Deep Learning Models
Pith reviewed 2026-05-09 14:51 UTC · model grok-4.3
The pith
An LSTM model reaches 92 percent accuracy classifying positive, negative, and neutral sentiments in Mobile Legends app reviews, beating traditional machine learning baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The LSTM model outperforms the classical Machine Learning baselines, achieving 92% accuracy and a weighted F1-score of 91%. The findings indicate that deep learning is more effective for handling informal and context-dependent user review text.
What carries the argument
LSTM network that processes sequential text dependencies in informal reviews
If this is right
- LSTM models better preserve context when classifying short, slang-filled app reviews than TF-IDF vectorized baselines.
- Deep learning yields measurably higher accuracy and F1 scores on three-class sentiment tasks from user-generated text.
- App teams can obtain more trustworthy signals about player satisfaction by applying sequence-aware models to store reviews.
Where Pith is reading between the lines
- More reliable sentiment labels could let developers spot emerging complaints about balance or bugs earlier in a game's lifecycle.
- The same LSTM approach may transfer to reviews of other mobile games or apps whose users write in casual, context-heavy language.
- If labeling was done without quality checks, the reported performance gap could shrink on cleaner or differently balanced data.
Load-bearing premise
The 10,000 reviews carry accurate labels that reflect actual user sentiments and are representative of the full population of reviews.
What would settle it
Label a fresh set of 1,000 Mobile Legends reviews by multiple independent annotators, then measure whether the LSTM still exceeds 85 percent accuracy on that held-out set.
Figures
read the original abstract
This paper compares Machine Learning and LSTM-based Deep Learning methods for sentiment analysis of Mobile Legends app reviews. Using a dataset of 10,000 reviews labeled as positive, negative, and neutral, the study evaluates traditional models with TF-IDF and PyCaret AutoML and compares them against an LSTM model designed to capture sequential text dependencies. The results show that the LSTM model outperforms the classical Machine Learning baselines, achieving 92% accuracy and a weighted F1-score of 91%. The findings indicate that deep learning is more effective for handling informal and context-dependent user review text.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares traditional machine learning models (TF-IDF features combined with PyCaret AutoML) against an LSTM-based deep learning model for three-class sentiment analysis (positive/negative/neutral) on a dataset of 10,000 Mobile Legends app reviews. It reports that the LSTM achieves 92% accuracy and 91% weighted F1-score, outperforming the baselines, and concludes that deep learning is more effective for informal, context-dependent review text.
Significance. If the experimental protocol were fully documented and reproducible, the work would offer a useful data point on the relative merits of sequential models versus bag-of-words AutoML pipelines for short, noisy user-generated text in the mobile-gaming domain. The inclusion of an AutoML baseline is a modest strength that reduces the risk of under-tuned classical models. However, the absence of any description of label provenance, data partitioning, or statistical validation means the headline performance numbers cannot currently be interpreted as evidence of model quality.
major comments (3)
- [Methodology] Methodology / Data section: the manuscript supplies no information whatsoever on how the 10,000 reviews were labeled positive/negative/neutral (star-rating mapping, manual annotation, heuristics, or otherwise), nor any class-balance statistics, inter-annotator agreement, or quality-control steps. Because the central claim rests on the LSTM reaching 92% accuracy against these labels, the lack of label provenance renders the accuracy figure uninterpretable.
- [Experimental Setup] Experimental Setup / Results: no train-test split ratio, cross-validation procedure, hyperparameter search details for the LSTM (layers, hidden size, dropout, optimizer, epochs), or PyCaret configuration are reported. Without these, it is impossible to determine whether the reported 1–2 point gains over baselines reflect genuine modeling superiority or simply differences in tuning effort.
- [Results] Results section: performance is reported only as aggregate accuracy and weighted F1; no confusion matrices, per-class F1 scores, statistical significance tests on the differences, or error analysis (e.g., failure modes on neutral or sarcastic reviews) are provided. This prevents assessment of whether the LSTM’s advantage is concentrated where sequential modeling should help.
minor comments (2)
- [Introduction] The abstract and introduction cite no prior work on LSTM or transformer-based sentiment analysis of app reviews, making it difficult to situate the contribution relative to existing literature.
- [Results] Figure captions and axis labels in the results plots are insufficiently descriptive; readers cannot tell which curve corresponds to which model without consulting the text.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable suggestions. We agree that the manuscript requires substantial revisions to improve the documentation of the methodology, experimental setup, and results analysis to ensure reproducibility and to allow for a more nuanced evaluation of the models' performance.
read point-by-point responses
-
Referee: [Methodology] Methodology / Data section: the manuscript supplies no information whatsoever on how the 10,000 reviews were labeled positive/negative/neutral (star-rating mapping, manual annotation, heuristics, or otherwise), nor any class-balance statistics, inter-annotator agreement, or quality-control steps. Because the central claim rests on the LSTM reaching 92% accuracy against these labels, the lack of label provenance renders the accuracy figure uninterpretable.
Authors: We agree that this is a critical omission. The revised manuscript will include a detailed description of the data labeling process in the Methodology section. Specifically, we will explain the labeling methodology used, provide class balance statistics, and describe any quality control steps. This will make the reported accuracy interpretable. revision: yes
-
Referee: [Experimental Setup] Experimental Setup / Results: no train-test split ratio, cross-validation procedure, hyperparameter search details for the LSTM (layers, hidden size, dropout, optimizer, epochs), or PyCaret configuration are reported. Without these, it is impossible to determine whether the reported 1–2 point gains over baselines reflect genuine modeling superiority or simply differences in tuning effort.
Authors: We acknowledge the need for full experimental details. In the revised paper, we will add the train-test split ratio, cross-validation procedure, and complete hyperparameter specifications for the LSTM model as well as the configuration used in PyCaret. This will allow readers to reproduce the experiments and assess the fairness of the comparison. revision: yes
-
Referee: [Results] Results section: performance is reported only as aggregate accuracy and weighted F1; no confusion matrices, per-class F1 scores, statistical significance tests on the differences, or error analysis (e.g., failure modes on neutral or sarcastic reviews) are provided. This prevents assessment of whether the LSTM’s advantage is concentrated where sequential modeling should help.
Authors: We agree that more detailed results reporting is warranted. The revised Results section will include confusion matrices, per-class F1 scores, statistical significance tests for the performance differences, and an error analysis focusing on the model's behavior on neutral and context-dependent reviews. This will better demonstrate where the LSTM provides advantages. revision: yes
Circularity Check
No circularity: direct empirical model comparison on fixed labeled data
full rationale
The paper conducts an empirical benchmark: it collects 10,000 app reviews, applies TF-IDF + PyCaret classical ML baselines, trains an LSTM, and reports measured accuracy/F1 numbers. No derivation chain exists, no quantities are fitted then re-presented as independent predictions, and no self-citations or uniqueness theorems are invoked to justify the architecture or results. The reported 92% accuracy is simply the observed performance on the given split; it does not reduce to the inputs by construction. Labeling details are absent, but that is a data-validity issue, not a circularity in any claimed derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- LSTM hyperparameters and training settings
axioms (1)
- domain assumption Reviews can be reliably labeled as positive, negative, or neutral by human or automated means
Reference graph
Works this paper leans on
-
[1]
Setiawan, B. (2025). A review of sentiment analysis applications in Indonesia between 2023–2024.Journal of Information Engineering and Educational Technology, 8(2), 71–83. doi:10.26740/jieet.v8n2.p71-83
-
[2]
B., Andriansyah, C., Sensuse, D
Hamid, R. B., Andriansyah, C., Sensuse, D. I., Lusa, S., Elisabeth, D., & Safitri, N. (2025). Sentiment analysis and topic modeling for discovering knowledge in Indonesian mobile government applications.Jurnal Teknik Informatika (JUTIF), 6(5), 3188–3203. doi:10.52436/1.jutif.2025.6.6.4991
-
[3]
Setyani, T., Sari, K., Heidy, H. N., & Suryono, R. R. (2026). Analisis sentimen pengguna aplikasi Jamsostek Mobile berdasarkan ulasan Google Play Store menggunakan algoritma support vector machine dan Naive Bayes.MALCOM: Indonesian Journal of Machine Learning and Computer Science, 6(1), 373–384. doi:10.57152/malcom.v6i1.2526
-
[4]
Ningsih, T. S., Hermanto, T. I., & Nugroho, I. M. (2024). Sentiment analysis of mobile provider application reviews using Na ¨ıve Bayes and support vector machine. Sinkron: Jurnal dan Penelitian Teknik Informatika, 8(2). doi:10.33395/sinkron.v8i2.13469
-
[5]
Tarwoto, T., Nugroho, R., Azka, N., & Graha, W. S. R. (2025). Analisis sentimen ulasan aplikasi Mobile JKN di Google Play Store menggunakan IndoBERT.Jurnal Teknologi Informasi dan Komunikasi, 9(2), 495–505. doi:10.35870/jtik.v9i2.3340
-
[6]
S., Christian, E., & Lestari, A
Pangestu, A. S., Christian, E., & Lestari, A. (2025). Aspect-based sentiment analysis pada ulasan pengguna aplikasi Mobile JKN menggunakan model berbasis transformer.Journal of Information Technology and Computer Science, 5(4). doi:10.47111/jointecoms.v5i4.25348
-
[7]
Putra, D. N. A. (2024). Sentiment analysis of national health security mobile application review using machine learning.Jurnal Jaminan Kesehatan Nasional, 4(2). doi:10.53756/jjkn.v4i2.269
-
[8]
Fikri, A. A. Z., & Ridho, H. (2025). Identification of inconsistent reviews and ratings on apps using sentiment analysis: Case study on Indonesian digital media platform. Metris: Jurnal Sains dan Teknologi, 26(1). doi:10.25170/metris.v26i01.6779
-
[9]
Ayomi, J. M., Vitianingsih, A. V ., Kristyawan, Y ., Maukar, A. L., & Widiartin, T. (2026). Sentiment analysis of user reviews for the PLN Mobile application using Na¨ıve Bayes and long short-term memory.Journal of Information Systems and Informatics, 7(4). doi:10.63158/journalisi.v7i4.1342
-
[10]
Nugroho, K. S., Sukmadewa, A. Y ., Wuswilahaken, H. D. W., Bachtiar, F. A., & Yudistira, N. (2021). BERT fine-tuning for sentiment analysis on Indonesian mobile apps reviews.arXiv preprint.https://arxiv.org/abs/2107.06802. 8
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.