Sentiment Analysis of Mobile Legends App Reviews Using Machine Learning and LSTM-Based Deep Learning Models

Ardika Satria; Daris Samudra; Kharisa Harvanny; Luluk Muthoharoh; Martin Clinton Tosima Manullang; Vira Putri Maharani

arxiv: 2605.01317 · v1 · submitted 2026-05-02 · 💻 cs.CL

Sentiment Analysis of Mobile Legends App Reviews Using Machine Learning and LSTM-Based Deep Learning Models

Vira Putri Maharani , Kharisa Harvanny , Daris Samudra , Luluk Muthoharoh , Ardika Satria , Martin Clinton Tosima Manullang This is my paper

Pith reviewed 2026-05-09 14:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords sentiment analysisLSTMdeep learningmachine learningMobile Legendsapp reviewstext classification

0 comments

The pith

An LSTM model reaches 92 percent accuracy classifying positive, negative, and neutral sentiments in Mobile Legends app reviews, beating traditional machine learning baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether deep learning can improve sentiment analysis on 10,000 labeled user reviews of a popular mobile game. It trains classical models with TF-IDF features and AutoML, then compares them to an LSTM network built to track word order and context in short, informal texts. The LSTM records 92 percent accuracy and 91 percent weighted F1-score. A sympathetic reader would care because app developers rely on review sentiment to decide which features to fix or promote, and better classification could reduce misreading of player feedback.

Core claim

The LSTM model outperforms the classical Machine Learning baselines, achieving 92% accuracy and a weighted F1-score of 91%. The findings indicate that deep learning is more effective for handling informal and context-dependent user review text.

What carries the argument

LSTM network that processes sequential text dependencies in informal reviews

If this is right

LSTM models better preserve context when classifying short, slang-filled app reviews than TF-IDF vectorized baselines.
Deep learning yields measurably higher accuracy and F1 scores on three-class sentiment tasks from user-generated text.
App teams can obtain more trustworthy signals about player satisfaction by applying sequence-aware models to store reviews.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

More reliable sentiment labels could let developers spot emerging complaints about balance or bugs earlier in a game's lifecycle.
The same LSTM approach may transfer to reviews of other mobile games or apps whose users write in casual, context-heavy language.
If labeling was done without quality checks, the reported performance gap could shrink on cleaner or differently balanced data.

Load-bearing premise

The 10,000 reviews carry accurate labels that reflect actual user sentiments and are representative of the full population of reviews.

What would settle it

Label a fresh set of 1,000 Mobile Legends reviews by multiple independent annotators, then measure whether the LSTM still exceeds 85 percent accuracy on that held-out set.

Figures

Figures reproduced from arXiv: 2605.01317 by Ardika Satria, Daris Samudra, Kharisa Harvanny, Luluk Muthoharoh, Martin Clinton Tosima Manullang, Vira Putri Maharani.

**Figure 1.** Figure 1: Compact visualization of the evaluated models and LSTM behaviour. view at source ↗

read the original abstract

This paper compares Machine Learning and LSTM-based Deep Learning methods for sentiment analysis of Mobile Legends app reviews. Using a dataset of 10,000 reviews labeled as positive, negative, and neutral, the study evaluates traditional models with TF-IDF and PyCaret AutoML and compares them against an LSTM model designed to capture sequential text dependencies. The results show that the LSTM model outperforms the classical Machine Learning baselines, achieving 92% accuracy and a weighted F1-score of 91%. The findings indicate that deep learning is more effective for handling informal and context-dependent user review text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a routine LSTM vs TF-IDF comparison on 10k game reviews that reports 92% accuracy but supplies no details on how the labels were produced.

read the letter

The main takeaway is that this paper applies a standard LSTM to sentiment analysis of Mobile Legends reviews and claims it beats classical baselines at 92% accuracy and 91% weighted F1, yet the results rest on an unexamined foundation. The work itself is a straightforward empirical comparison using TF-IDF features with PyCaret for the machine learning side and a basic LSTM for the deep learning side. It shows the expected pattern that the sequential model handles informal review text better than bag-of-words approaches. That part is executed cleanly enough for what it is, and the numbers are presented without obvious calculation errors. For someone who needs a quick practical benchmark on this narrow domain, the paper supplies usable reference points. Beyond that, there is little new. The techniques are established, the dataset is one more collection of app reviews, and no architectural changes or theoretical points are offered. The soft spots sit right at the center. The abstract and description give no account of how the 10,000 reviews were labeled positive, negative, or neutral. There is no mention of star-rating mapping, manual annotation, inter-annotator checks, or class balance. Without that information the accuracy figure cannot be read as evidence of model quality; the LSTM could simply be learning whatever regularities exist in the label assignment process. Additional gaps include the absence of train-test split details, hyperparameter selection, and any error analysis. These omissions make it hard to judge whether the reported improvement is robust or reproducible. This kind of paper mainly serves students or app developers who want an example run on game feedback data. It does not advance NLP methods or resolve open questions, so it would not be useful for a reading group focused on research contributions. I would not send it to peer review. The methodological gaps are too large and the novelty too low for referees to invest time in it.

Referee Report

3 major / 2 minor

Summary. The manuscript compares traditional machine learning models (TF-IDF features combined with PyCaret AutoML) against an LSTM-based deep learning model for three-class sentiment analysis (positive/negative/neutral) on a dataset of 10,000 Mobile Legends app reviews. It reports that the LSTM achieves 92% accuracy and 91% weighted F1-score, outperforming the baselines, and concludes that deep learning is more effective for informal, context-dependent review text.

Significance. If the experimental protocol were fully documented and reproducible, the work would offer a useful data point on the relative merits of sequential models versus bag-of-words AutoML pipelines for short, noisy user-generated text in the mobile-gaming domain. The inclusion of an AutoML baseline is a modest strength that reduces the risk of under-tuned classical models. However, the absence of any description of label provenance, data partitioning, or statistical validation means the headline performance numbers cannot currently be interpreted as evidence of model quality.

major comments (3)

[Methodology] Methodology / Data section: the manuscript supplies no information whatsoever on how the 10,000 reviews were labeled positive/negative/neutral (star-rating mapping, manual annotation, heuristics, or otherwise), nor any class-balance statistics, inter-annotator agreement, or quality-control steps. Because the central claim rests on the LSTM reaching 92% accuracy against these labels, the lack of label provenance renders the accuracy figure uninterpretable.
[Experimental Setup] Experimental Setup / Results: no train-test split ratio, cross-validation procedure, hyperparameter search details for the LSTM (layers, hidden size, dropout, optimizer, epochs), or PyCaret configuration are reported. Without these, it is impossible to determine whether the reported 1–2 point gains over baselines reflect genuine modeling superiority or simply differences in tuning effort.
[Results] Results section: performance is reported only as aggregate accuracy and weighted F1; no confusion matrices, per-class F1 scores, statistical significance tests on the differences, or error analysis (e.g., failure modes on neutral or sarcastic reviews) are provided. This prevents assessment of whether the LSTM’s advantage is concentrated where sequential modeling should help.

minor comments (2)

[Introduction] The abstract and introduction cite no prior work on LSTM or transformer-based sentiment analysis of app reviews, making it difficult to situate the contribution relative to existing literature.
[Results] Figure captions and axis labels in the results plots are insufficiently descriptive; readers cannot tell which curve corresponds to which model without consulting the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We agree that the manuscript requires substantial revisions to improve the documentation of the methodology, experimental setup, and results analysis to ensure reproducibility and to allow for a more nuanced evaluation of the models' performance.

read point-by-point responses

Referee: [Methodology] Methodology / Data section: the manuscript supplies no information whatsoever on how the 10,000 reviews were labeled positive/negative/neutral (star-rating mapping, manual annotation, heuristics, or otherwise), nor any class-balance statistics, inter-annotator agreement, or quality-control steps. Because the central claim rests on the LSTM reaching 92% accuracy against these labels, the lack of label provenance renders the accuracy figure uninterpretable.

Authors: We agree that this is a critical omission. The revised manuscript will include a detailed description of the data labeling process in the Methodology section. Specifically, we will explain the labeling methodology used, provide class balance statistics, and describe any quality control steps. This will make the reported accuracy interpretable. revision: yes
Referee: [Experimental Setup] Experimental Setup / Results: no train-test split ratio, cross-validation procedure, hyperparameter search details for the LSTM (layers, hidden size, dropout, optimizer, epochs), or PyCaret configuration are reported. Without these, it is impossible to determine whether the reported 1–2 point gains over baselines reflect genuine modeling superiority or simply differences in tuning effort.

Authors: We acknowledge the need for full experimental details. In the revised paper, we will add the train-test split ratio, cross-validation procedure, and complete hyperparameter specifications for the LSTM model as well as the configuration used in PyCaret. This will allow readers to reproduce the experiments and assess the fairness of the comparison. revision: yes
Referee: [Results] Results section: performance is reported only as aggregate accuracy and weighted F1; no confusion matrices, per-class F1 scores, statistical significance tests on the differences, or error analysis (e.g., failure modes on neutral or sarcastic reviews) are provided. This prevents assessment of whether the LSTM’s advantage is concentrated where sequential modeling should help.

Authors: We agree that more detailed results reporting is warranted. The revised Results section will include confusion matrices, per-class F1 scores, statistical significance tests for the performance differences, and an error analysis focusing on the model's behavior on neutral and context-dependent reviews. This will better demonstrate where the LSTM provides advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical model comparison on fixed labeled data

full rationale

The paper conducts an empirical benchmark: it collects 10,000 app reviews, applies TF-IDF + PyCaret classical ML baselines, trains an LSTM, and reports measured accuracy/F1 numbers. No derivation chain exists, no quantities are fitted then re-presented as independent predictions, and no self-citations or uniqueness theorems are invoked to justify the architecture or results. The reported 92% accuracy is simply the observed performance on the given split; it does not reduce to the inputs by construction. Labeling details are absent, but that is a data-validity issue, not a circularity in any claimed derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger reflects unstated but necessary elements for any supervised sentiment task; full paper would likely reveal additional fitted hyperparameters and preprocessing choices.

free parameters (1)

LSTM hyperparameters and training settings
Typical values such as hidden size, learning rate, and epochs are optimized on the data but not reported.

axioms (1)

domain assumption Reviews can be reliably labeled as positive, negative, or neutral by human or automated means
Core assumption of supervised classification; no labeling protocol is described.

pith-pipeline@v0.9.0 · 5415 in / 1309 out tokens · 57361 ms · 2026-05-09T14:51:00.778703+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

Setiawan, B. (2025). A review of sentiment analysis applications in Indonesia between 2023–2024.Journal of Information Engineering and Educational Technology, 8(2), 71–83. doi:10.26740/jieet.v8n2.p71-83

work page doi:10.26740/jieet.v8n2.p71-83 2025
[2]

B., Andriansyah, C., Sensuse, D

Hamid, R. B., Andriansyah, C., Sensuse, D. I., Lusa, S., Elisabeth, D., & Safitri, N. (2025). Sentiment analysis and topic modeling for discovering knowledge in Indonesian mobile government applications.Jurnal Teknik Informatika (JUTIF), 6(5), 3188–3203. doi:10.52436/1.jutif.2025.6.6.4991

work page doi:10.52436/1.jutif.2025.6.6.4991 2025
[3]

N., & Suryono, R

Setyani, T., Sari, K., Heidy, H. N., & Suryono, R. R. (2026). Analisis sentimen pengguna aplikasi Jamsostek Mobile berdasarkan ulasan Google Play Store menggunakan algoritma support vector machine dan Naive Bayes.MALCOM: Indonesian Journal of Machine Learning and Computer Science, 6(1), 373–384. doi:10.57152/malcom.v6i1.2526

work page doi:10.57152/malcom.v6i1.2526 2026
[4]

S., Hermanto, T

Ningsih, T. S., Hermanto, T. I., & Nugroho, I. M. (2024). Sentiment analysis of mobile provider application reviews using Na ¨ıve Bayes and support vector machine. Sinkron: Jurnal dan Penelitian Teknik Informatika, 8(2). doi:10.33395/sinkron.v8i2.13469

work page doi:10.33395/sinkron.v8i2.13469 2024
[5]

Tarwoto, T., Nugroho, R., Azka, N., & Graha, W. S. R. (2025). Analisis sentimen ulasan aplikasi Mobile JKN di Google Play Store menggunakan IndoBERT.Jurnal Teknologi Informasi dan Komunikasi, 9(2), 495–505. doi:10.35870/jtik.v9i2.3340

work page doi:10.35870/jtik.v9i2.3340 2025
[6]

S., Christian, E., & Lestari, A

Pangestu, A. S., Christian, E., & Lestari, A. (2025). Aspect-based sentiment analysis pada ulasan pengguna aplikasi Mobile JKN menggunakan model berbasis transformer.Journal of Information Technology and Computer Science, 5(4). doi:10.47111/jointecoms.v5i4.25348

work page doi:10.47111/jointecoms.v5i4.25348 2025
[7]

Putra, D. N. A. (2024). Sentiment analysis of national health security mobile application review using machine learning.Jurnal Jaminan Kesehatan Nasional, 4(2). doi:10.53756/jjkn.v4i2.269

work page doi:10.53756/jjkn.v4i2.269 2024
[8]

Fikri, A. A. Z., & Ridho, H. (2025). Identification of inconsistent reviews and ratings on apps using sentiment analysis: Case study on Indonesian digital media platform. Metris: Jurnal Sains dan Teknologi, 26(1). doi:10.25170/metris.v26i01.6779

work page doi:10.25170/metris.v26i01.6779 2025
[9]

M., Vitianingsih, A

Ayomi, J. M., Vitianingsih, A. V ., Kristyawan, Y ., Maukar, A. L., & Widiartin, T. (2026). Sentiment analysis of user reviews for the PLN Mobile application using Na¨ıve Bayes and long short-term memory.Journal of Information Systems and Informatics, 7(4). doi:10.63158/journalisi.v7i4.1342

work page doi:10.63158/journalisi.v7i4.1342 2026
[10]

S., Sukmadewa, A

Nugroho, K. S., Sukmadewa, A. Y ., Wuswilahaken, H. D. W., Bachtiar, F. A., & Yudistira, N. (2021). BERT fine-tuning for sentiment analysis on Indonesian mobile apps reviews.arXiv preprint.https://arxiv.org/abs/2107.06802. 8

work page arXiv 2021

[1] [1]

Setiawan, B. (2025). A review of sentiment analysis applications in Indonesia between 2023–2024.Journal of Information Engineering and Educational Technology, 8(2), 71–83. doi:10.26740/jieet.v8n2.p71-83

work page doi:10.26740/jieet.v8n2.p71-83 2025

[2] [2]

B., Andriansyah, C., Sensuse, D

Hamid, R. B., Andriansyah, C., Sensuse, D. I., Lusa, S., Elisabeth, D., & Safitri, N. (2025). Sentiment analysis and topic modeling for discovering knowledge in Indonesian mobile government applications.Jurnal Teknik Informatika (JUTIF), 6(5), 3188–3203. doi:10.52436/1.jutif.2025.6.6.4991

work page doi:10.52436/1.jutif.2025.6.6.4991 2025

[3] [3]

N., & Suryono, R

Setyani, T., Sari, K., Heidy, H. N., & Suryono, R. R. (2026). Analisis sentimen pengguna aplikasi Jamsostek Mobile berdasarkan ulasan Google Play Store menggunakan algoritma support vector machine dan Naive Bayes.MALCOM: Indonesian Journal of Machine Learning and Computer Science, 6(1), 373–384. doi:10.57152/malcom.v6i1.2526

work page doi:10.57152/malcom.v6i1.2526 2026

[4] [4]

S., Hermanto, T

Ningsih, T. S., Hermanto, T. I., & Nugroho, I. M. (2024). Sentiment analysis of mobile provider application reviews using Na ¨ıve Bayes and support vector machine. Sinkron: Jurnal dan Penelitian Teknik Informatika, 8(2). doi:10.33395/sinkron.v8i2.13469

work page doi:10.33395/sinkron.v8i2.13469 2024

[5] [5]

Tarwoto, T., Nugroho, R., Azka, N., & Graha, W. S. R. (2025). Analisis sentimen ulasan aplikasi Mobile JKN di Google Play Store menggunakan IndoBERT.Jurnal Teknologi Informasi dan Komunikasi, 9(2), 495–505. doi:10.35870/jtik.v9i2.3340

work page doi:10.35870/jtik.v9i2.3340 2025

[6] [6]

S., Christian, E., & Lestari, A

Pangestu, A. S., Christian, E., & Lestari, A. (2025). Aspect-based sentiment analysis pada ulasan pengguna aplikasi Mobile JKN menggunakan model berbasis transformer.Journal of Information Technology and Computer Science, 5(4). doi:10.47111/jointecoms.v5i4.25348

work page doi:10.47111/jointecoms.v5i4.25348 2025

[7] [7]

Putra, D. N. A. (2024). Sentiment analysis of national health security mobile application review using machine learning.Jurnal Jaminan Kesehatan Nasional, 4(2). doi:10.53756/jjkn.v4i2.269

work page doi:10.53756/jjkn.v4i2.269 2024

[8] [8]

Fikri, A. A. Z., & Ridho, H. (2025). Identification of inconsistent reviews and ratings on apps using sentiment analysis: Case study on Indonesian digital media platform. Metris: Jurnal Sains dan Teknologi, 26(1). doi:10.25170/metris.v26i01.6779

work page doi:10.25170/metris.v26i01.6779 2025

[9] [9]

M., Vitianingsih, A

Ayomi, J. M., Vitianingsih, A. V ., Kristyawan, Y ., Maukar, A. L., & Widiartin, T. (2026). Sentiment analysis of user reviews for the PLN Mobile application using Na¨ıve Bayes and long short-term memory.Journal of Information Systems and Informatics, 7(4). doi:10.63158/journalisi.v7i4.1342

work page doi:10.63158/journalisi.v7i4.1342 2026

[10] [10]

S., Sukmadewa, A

Nugroho, K. S., Sukmadewa, A. Y ., Wuswilahaken, H. D. W., Bachtiar, F. A., & Yudistira, N. (2021). BERT fine-tuning for sentiment analysis on Indonesian mobile apps reviews.arXiv preprint.https://arxiv.org/abs/2107.06802. 8

work page arXiv 2021