Recognition: no theorem link
A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews
Pith reviewed 2026-05-15 09:18 UTC · model grok-4.3
The pith
Classical models like Random Forest outperform fine-tuned transformers for sentiment analysis of English-Bangla government banking app reviews.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that traditional models such as Random Forest (0.815 accuracy) and Linear SVM (0.804 weighted F1) outperform a fine-tuned XLM-RoBERTa (0.793) on bilingual sentiment classification of app reviews, with McNemar's test confirming classical models are significantly better than the off-the-shelf transformer variant. Aspect analysis with DeBERTa-v3 shows primary dissatisfaction with transaction speed and interface design, and the eJanata app receives the lowest ratings. A 16.1-percentage-point performance gap between Bangla and English text is noted.
What carries the argument
The hybrid labeling method that combines reviewer star ratings with an independent XLM-RoBERTa classifier to generate ground-truth labels for model training and evaluation.
If this is right
- Random Forest achieves the highest accuracy of 0.815 on the combined English-Bangla dataset.
- Linear SVM achieves the highest weighted F1 score of 0.804.
- Aspect-level sentiment shows the strongest user dissatisfaction with transaction speed and app interface design.
- The eJanata app receives the worst ratings across all four government apps studied.
- A 16.1 percentage point accuracy gap between Bangla and English text indicates specific challenges for low-resource language processing.
Where Pith is reading between the lines
- Classical models may offer a more practical and lower-cost option than transformer fine-tuning for similar sentiment tasks in other low-resource languages.
- Prioritizing fixes for transaction speed and interface issues could improve user trust in government digital banking services.
- The performance gap suggests targeted development of Bangla-specific resources could narrow differences in future models.
- Similar multi-model comparisons could be applied to reviews of other public service apps in comparable economies.
Load-bearing premise
The hybrid labeling method that combines star ratings with an independent XLM-RoBERTa classifier produces sufficiently reliable ground-truth labels despite only moderate inter-method agreement.
What would settle it
A replication study that replaces the hybrid labels with fully independent human annotations and finds that fine-tuned transformers then match or exceed the accuracy of Random Forest and Linear SVM would falsify the performance superiority claim.
Figures
read the original abstract
For millions of users in developing economies who depend on mobile banking as their primary gateway to financial services, app quality directly shapes financial access. The study analyzed 5,652 Google Play reviews in English and Bangla (filtered from 11,414 raw reviews) for four Bangladeshi government banking apps. The authors used a hybrid labeling approach that combined use of the reviewer's star rating for each review along with a separate independent XLM-RoBERTa classifier to produce moderate inter-method agreement (kappa = 0.459). Traditional models outperformed transformer-based ones: Random Forest produced the highest accuracy (0.815), while Linear SVM produced the highest weighted F1 score (0.804); both were higher than the performance of fine-tuned XLM-RoBERTa (0.793). McNemar's test confirmed that all classical models were significantly superior to the off-the-shelf XLM-RoBERTa (p < 0.05), while differences with the fine-tuned variant were not statistically significant. DeBERTa-v3 was applied to analyze the sentiment at the aspect level across the reviews for the four apps; the reviewers expressed their dissatisfaction primarily with the speed of transactions and with the poor design of interfaces; eJanata app received the worst ratings from the reviewers across all apps. Three policy recommendations are made based on these findings - remediation of app quality, trust-centred release management, and Bangla-first NLP adoption - to assist state-owned banks in moving towards improving their digital services through data-driven methods. Notably, a 16.1-percentage-point accuracy gap between Bangla and English text highlights the need for low-resource language model development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes 5,652 English and Bangla Google Play reviews for four Bangladeshi government mobile banking apps. It employs a hybrid labeling scheme that merges star ratings with predictions from an independent XLM-RoBERTa classifier (κ=0.459), then compares classical models (Random Forest at 0.815 accuracy, Linear SVM at 0.804 weighted F1) against fine-tuned XLM-RoBERTa (0.793 accuracy) and an off-the-shelf baseline. McNemar’s tests show classical models significantly outperform the off-the-shelf transformer but not the fine-tuned variant. Aspect-level sentiment via DeBERTa-v3 identifies transaction speed and interface design as primary pain points, with eJanata rated worst; three policy recommendations follow.
Significance. If the performance ordering is robust, the work supplies concrete evidence that classical models can remain competitive for low-resource bilingual sentiment tasks in a high-stakes public-service domain. The aspect-level findings and policy suggestions offer direct utility for app developers and regulators in developing economies. The reported 16.1-point English–Bangla gap also underscores the need for targeted low-resource model development.
major comments (1)
- [Methods (hybrid labeling)] Methods (hybrid labeling): the ground-truth labels are constructed by combining star ratings with an independent XLM-RoBERTa classifier (κ=0.459). Because the evaluation set therefore contains predictions from a model architecturally close to the fine-tuned XLM-RoBERTa whose accuracy is reported as 0.793, any systematic alignment between the labeler and the fine-tuned model can inflate the transformer’s apparent performance relative to Random Forest and Linear SVM. No ablation that recomputes all metrics on star-rating labels alone is presented, leaving the central claim that classical models outperform transformers vulnerable to label-construction bias.
minor comments (2)
- [Abstract] Abstract: the 16.1-percentage-point accuracy gap between Bangla and English is stated without indicating which model(s) or evaluation split produced the figure.
- [Results] Results: McNemar’s test p-values are given only for the off-the-shelf baseline; the exact p-values for the fine-tuned XLM-RoBERTa comparisons should be reported to allow readers to assess the “not statistically significant” claim.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. The concern about potential bias in our hybrid labeling scheme is well-taken, and we address it directly below. We are committed to strengthening the manuscript through the requested revisions.
read point-by-point responses
-
Referee: [Methods (hybrid labeling)] Methods (hybrid labeling): the ground-truth labels are constructed by combining star ratings with an independent XLM-RoBERTa classifier (κ=0.459). Because the evaluation set therefore contains predictions from a model architecturally close to the fine-tuned XLM-RoBERTa whose accuracy is reported as 0.793, any systematic alignment between the labeler and the fine-tuned model can inflate the transformer’s apparent performance relative to Random Forest and Linear SVM. No ablation that recomputes all metrics on star-rating labels alone is presented, leaving the central claim that classical models outperform transformers vulnerable to label-construction bias.
Authors: We appreciate the referee's identification of this methodological vulnerability. The hybrid labels were constructed to leverage both explicit star ratings and an independent XLM-RoBERTa model (distinct training run and data split from the fine-tuned evaluation model), with the moderate κ=0.459 reflecting only partial alignment. Nevertheless, we acknowledge that architectural similarity could introduce systematic bias favoring the transformer. In the revised manuscript we will add a full ablation recomputing accuracy, weighted F1, and McNemar tests using star-rating labels alone. This will allow readers to assess the robustness of the classical-model advantage under a purely human-derived ground truth. revision: yes
Circularity Check
No circularity: purely empirical comparison on held-out data with independent labeling
full rationale
The paper conducts a standard supervised learning evaluation: reviews are labeled via a hybrid process (star ratings plus a separate independent XLM-RoBERTa classifier), the labeled corpus is split, models (including fine-tuned XLM-RoBERTa) are trained on the training portion, and accuracy/F1/McNemar results are reported on the held-out test set. No equations, derivations, or self-referential definitions appear; performance numbers are direct empirical measurements rather than quantities forced by construction from fitted parameters or prior self-citations. The labeling step is external to the model training loop and does not reduce any reported metric to the inputs by definition. This is a normal non-circular empirical ML comparison.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bangladesh Bank, “Annual Report 2022–2023,” Bangladesh Bank, Dhaka, Bangladesh, 2023. [Online]. Available: https://www.bb.org.bd/ pub/annual/anreport/ar2022-2023.pdf
work page 2022
-
[2]
Bangladesh Economic Review 2022,
Ministry of Finance, “Bangladesh Economic Review 2022,” Finance Division, Government of the People’s Republic of Bangladesh, Dhaka,
work page 2022
- [3]
-
[4]
A survey of app store analysis for software engineering,
W. Martin, F. Sarro, Y . Jia, Y . Zhang, and M. Harman, “A survey of app store analysis for software engineering,”IEEE Trans. Softw. Eng., vol. 43, no. 9, pp. 817–847, Sep. 2017
work page 2017
-
[5]
Un- supervised cross-lingual representation learning at scale,
A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Un- supervised cross-lingual representation learning at scale,” inProc. 58th Annu. Meeting Assoc. Comput. Linguistics (ACL), 2020, pp. 8440–8451
work page 2020
-
[6]
SentNoB: A dataset for analysing sentiment on noisy Bangla texts,
K. I. Islam, S. Kar, Md S. Islam, and M. R. Amin, “SentNoB: A dataset for analysing sentiment on noisy Bangla texts,” inFindings Assoc. Comput. Linguistics: EMNLP 2021, pp. 3265–3271
work page 2021
-
[7]
O. Sen, M. Fuad, Md N. Islam, J. Rabbi, M. Masud, Md K. Hasan, Md A. Awal, A. A. Fime, Md T. H. Fuad, D. Sikder, and Md A. R. Iftee, “Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning based methods,”IEEE Access, vol. 10, pp. 38999–39044, 2022
work page 2022
-
[8]
XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond,
F. Barbieri, L. Espinosa Anke, and J. Camacho-Collados, “XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond,” inProc. LREC, Marseille, France, 2022, pp. 258–266
work page 2022
-
[9]
Approximate statistical tests for comparing supervised classification learning algorithms,
T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,”Neural Comput., vol. 10, no. 7, pp. 1895–1923, 1998
work page 1923
-
[10]
DeBERTa: Decoding-enhanced BERT with disentangled attention,
P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with disentangled attention,” inProc. 9th Int. Conf. Learning Representations (ICLR), 2021
work page 2021
-
[11]
How do users like this feature? A fine grained sentiment analysis of app reviews,
E. Guzman and W. Maalej, “How do users like this feature? A fine grained sentiment analysis of app reviews,” inProc. IEEE 22nd Int. Requirements Eng. Conf. (RE), Karlskrona, Sweden, 2014, pp. 153–162
work page 2014
-
[12]
S. Sadiq, A. Umer, S. Ullah, S. Mirjalili, V . Rupapara, and M. Nappi, “Discrepancy detection between actual user reviews and numeric ratings of Google App Store using deep learning,”Expert Syst. Appl., vol. 181, Nov. 2021, Art. no. 115111
work page 2021
-
[13]
BERT: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProc. NAACL-HLT, Minneapolis, MN, USA, 2019, pp. 4171–4186
work page 2019
-
[14]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Y . Liuet al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv:1907.11692, Jul. 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[15]
Sentiment analysis of Google Play and App Store reviews of Threads,
A. Verma and R. Vashisth, “Sentiment analysis of Google Play and App Store reviews of Threads,” inProc. IEEE ICRITO, 2024
work page 2024
-
[16]
BanglaBook: A large-scale Bangla dataset for sentiment analysis from book reviews,
M. Kabir, O. B. Mahfuz, S. R. Raiyan, H. Mahmud, and M. K. Hasan, “BanglaBook: A large-scale Bangla dataset for sentiment analysis from book reviews,” inFindings ACL, Toronto, Canada, 2023, pp. 3521–3536
work page 2023
-
[17]
A. Bhattacharjeeet al., “BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla,” inFindings NAACL, Seattle, W A, USA, 2022, pp. 1318–1327
work page 2022
-
[18]
BLP-2023 Task 2: Sentiment analysis,
Md A. Hasan, F. Alam, A. Anjum, S. Das, and A. Anjum, “BLP-2023 Task 2: Sentiment analysis,” inProc. 1st Workshop Bangla Language Processing (BLP), EMNLP, Singapore, 2023, pp. 354–364
work page 2023
-
[19]
Prompt-based fine-tuning with multilingual transformers for language-independent sentiment analysis,
F. Ullah, S. Faizullah, I. U. Khan, T. Alghamdi, T. A. Syed, A. B. Alkho- dre, M. S. Ayub, and A. Karim, “Prompt-based fine-tuning with multilingual transformers for language-independent sentiment analysis,” Scientific Reports, vol. 15, Art. no. 20834, 2025
work page 2025
-
[20]
SemEval-2014 Task 4: Aspect based sen- timent analysis,
M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. Androut- sopoulos, and S. Manandhar, “SemEval-2014 Task 4: Aspect based sen- timent analysis,” inProc. 8th Int. Workshop Semantic Eval. (SemEval- 2014), COLING, Dublin, Ireland, 2014, pp. 27–35
work page 2014
-
[21]
Aspect-based sentiment analysis using local context focus mechanism with DeBERTa,
T. Zhao, J. Du, Z. Xue, A. Li, and Z. Guan, “Aspect-based sentiment analysis using local context focus mechanism with DeBERTa,” inProc. 5th Int. Conf. Data-driven Optim. Complex Syst. (DOCS), IEEE, 2023
work page 2023
-
[22]
Cross-lingual aspect- based sentiment analysis with aspect term code-switching,
W. Zhang, R. He, H. Peng, L. Bing, and W. Lam, “Cross-lingual aspect- based sentiment analysis with aspect term code-switching,” inProc. EMNLP, 2021, pp. 9220–9230
work page 2021
-
[23]
Instruct-DeBERTa: A hybrid approach for aspect- based sentiment analysis on textual reviews,
D. Jayakodyet al., “Instruct-DeBERTa: A hybrid approach for aspect- based sentiment analysis on textual reviews,” arXiv:2408.13202, Aug. 2024
-
[24]
T. Mahmood, S. Naseem, R. Ashraf, M. Asif, M. Umair, and M. Shah, “Recognizing factors effecting the use of mobile banking apps through sentiment and thematic analysis on user reviews,”Neural Comput. Appl., vol. 35, pp. 23853–23870, 2023
work page 2023
-
[25]
Comparison of topic modelling approaches in the banking context,
B. Ogunleye, T. Maswera, L. Hirsch, J. Gaudoin, and T. Brunsdon, “Comparison of topic modelling approaches in the banking context,” Applied Sciences, vol. 13, no. 2, Art. no. 797, 2023
work page 2023
-
[26]
Crowdsourcing user reviews to support the evolution of mobile apps,
F. Palomba, M. Linares-V ´asquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia, “Crowdsourcing user reviews to support the evolution of mobile apps,”J. Syst. Softw., vol. 137, pp. 143– 162, 2018
work page 2018
-
[27]
Studying the dialogue between users and developers of free apps in the Google Play Store,
S. Hassan, C. Tantithamthavorn, C.-P. Bezemer, and A. E. Hassan, “Studying the dialogue between users and developers of free apps in the Google Play Store,”Empirical Softw. Eng., vol. 23, no. 3, pp. 1275– 1312, 2018
work page 2018
-
[28]
A. A. Ryanet al., “FinTech: Deep learning-based sentiment classifica- tion of user reviews from various Bangladeshi mobile financial services,” inComputational Intelligence in Data Science (IFIP AICT), Springer, 2023, pp. 126–140
work page 2023
-
[29]
Scikit-learn: Machine learning in Python,
F. Pedregosaet al., “Scikit-learn: Machine learning in Python,”J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.