arxiv: 2604.13057 · v1 · submitted 2026-03-17 · 💻 cs.CL

Recognition: no theorem link

A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews

Md. Naim Molla , Md Muhtasim Munif Fahim , Md. Binyamin , Md Jahid Hasan Imran , Tonmoy Shil , Nura Rayhan , Md Rezaul Karim

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords sentiment classificationBangla NLPmobile banking appstraditional machine learningtransformer modelsaspect-based sentiment analysisgovernment appslow-resource languages

0 comments

The pith

Classical models like Random Forest outperform fine-tuned transformers for sentiment analysis of English-Bangla government banking app reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes 5,652 English and Bangla reviews of four Bangladeshi government mobile banking apps. It shows that traditional machine learning models achieve higher accuracy and F1 scores than transformer-based models on this task. Aspect-level analysis identifies transaction speed and interface design as the main user complaints, with one app rated worst overall. These results highlight practical needs for better monitoring and fixes in digital financial services for developing economies.

Core claim

The central claim is that traditional models such as Random Forest (0.815 accuracy) and Linear SVM (0.804 weighted F1) outperform a fine-tuned XLM-RoBERTa (0.793) on bilingual sentiment classification of app reviews, with McNemar's test confirming classical models are significantly better than the off-the-shelf transformer variant. Aspect analysis with DeBERTa-v3 shows primary dissatisfaction with transaction speed and interface design, and the eJanata app receives the lowest ratings. A 16.1-percentage-point performance gap between Bangla and English text is noted.

What carries the argument

The hybrid labeling method that combines reviewer star ratings with an independent XLM-RoBERTa classifier to generate ground-truth labels for model training and evaluation.

If this is right

Random Forest achieves the highest accuracy of 0.815 on the combined English-Bangla dataset.
Linear SVM achieves the highest weighted F1 score of 0.804.
Aspect-level sentiment shows the strongest user dissatisfaction with transaction speed and app interface design.
The eJanata app receives the worst ratings across all four government apps studied.
A 16.1 percentage point accuracy gap between Bangla and English text indicates specific challenges for low-resource language processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Classical models may offer a more practical and lower-cost option than transformer fine-tuning for similar sentiment tasks in other low-resource languages.
Prioritizing fixes for transaction speed and interface issues could improve user trust in government digital banking services.
The performance gap suggests targeted development of Bangla-specific resources could narrow differences in future models.
Similar multi-model comparisons could be applied to reviews of other public service apps in comparable economies.

Load-bearing premise

The hybrid labeling method that combines star ratings with an independent XLM-RoBERTa classifier produces sufficiently reliable ground-truth labels despite only moderate inter-method agreement.

What would settle it

A replication study that replaces the hybrid labels with fully independent human annotations and finds that fine-tuned transformers then match or exceed the accuracy of Random Forest and Linear SVM would falsify the performance superiority claim.

Figures

Figures reproduced from arXiv: 2604.13057 by Md. Binyamin, Md Jahid Hasan Imran, Md Muhtasim Munif Fahim, Md. Naim Molla, Md Rezaul Karim, Nura Rayhan, Tonmoy Shil.

**Figure 2.** Figure 2: Aspect-based sentiment polarity across six service dimensions. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Longitudinal sentiment trends (2021–2025) across the four applications. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

For millions of users in developing economies who depend on mobile banking as their primary gateway to financial services, app quality directly shapes financial access. The study analyzed 5,652 Google Play reviews in English and Bangla (filtered from 11,414 raw reviews) for four Bangladeshi government banking apps. The authors used a hybrid labeling approach that combined use of the reviewer's star rating for each review along with a separate independent XLM-RoBERTa classifier to produce moderate inter-method agreement (kappa = 0.459). Traditional models outperformed transformer-based ones: Random Forest produced the highest accuracy (0.815), while Linear SVM produced the highest weighted F1 score (0.804); both were higher than the performance of fine-tuned XLM-RoBERTa (0.793). McNemar's test confirmed that all classical models were significantly superior to the off-the-shelf XLM-RoBERTa (p < 0.05), while differences with the fine-tuned variant were not statistically significant. DeBERTa-v3 was applied to analyze the sentiment at the aspect level across the reviews for the four apps; the reviewers expressed their dissatisfaction primarily with the speed of transactions and with the poor design of interfaces; eJanata app received the worst ratings from the reviewers across all apps. Three policy recommendations are made based on these findings - remediation of app quality, trust-centred release management, and Bangla-first NLP adoption - to assist state-owned banks in moving towards improving their digital services through data-driven methods. Notably, a 16.1-percentage-point accuracy gap between Bangla and English text highlights the need for low-resource language model development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Classical models slightly beat a fine-tuned XLM-RoBERTa on this new bilingual dataset of Bangladeshi banking app reviews, but the hybrid labeling step creates real uncertainty in the main comparison.

read the letter

Hi, the core result is that Random Forest hit 0.815 accuracy and Linear SVM 0.804 weighted F1 on 5,652 English and Bangla reviews from four government banking apps, beating the fine-tuned XLM-RoBERTa at 0.793. McNemar tests back the edge over the off-the-shelf transformer. The paper also runs DeBERTa-v3 for aspect-level sentiment and surfaces clear complaints about transaction speed and interface design, with eJanata coming out worst. A 16-point accuracy gap between Bangla and English text is noted as well, along with three practical policy suggestions for the banks. What the work actually adds is the dataset itself and the targeted findings for these specific apps; the methods are standard pipelines applied to previously unexamined local reviews. The hybrid labeling is the main soft spot. Star ratings plus an independent XLM-RoBERTa classifier only reach kappa 0.459, so label noise is present, and because the labeling already involves an XLM-RoBERTa the fine-tuned model's score may be helped relative to the classical ones. No ablation on star-only labels is shown, which leaves the direct comparison less clean than it could be. The McNemar results versus the off-the-shelf model are unaffected. This is for researchers doing applied sentiment work in low-resource South Asian languages or digital finance in developing countries. The empirical numbers and local insights are solid enough to merit referee time, even with the labeling caveat. I'd send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper analyzes 5,652 English and Bangla Google Play reviews for four Bangladeshi government mobile banking apps. It employs a hybrid labeling scheme that merges star ratings with predictions from an independent XLM-RoBERTa classifier (κ=0.459), then compares classical models (Random Forest at 0.815 accuracy, Linear SVM at 0.804 weighted F1) against fine-tuned XLM-RoBERTa (0.793 accuracy) and an off-the-shelf baseline. McNemar’s tests show classical models significantly outperform the off-the-shelf transformer but not the fine-tuned variant. Aspect-level sentiment via DeBERTa-v3 identifies transaction speed and interface design as primary pain points, with eJanata rated worst; three policy recommendations follow.

Significance. If the performance ordering is robust, the work supplies concrete evidence that classical models can remain competitive for low-resource bilingual sentiment tasks in a high-stakes public-service domain. The aspect-level findings and policy suggestions offer direct utility for app developers and regulators in developing economies. The reported 16.1-point English–Bangla gap also underscores the need for targeted low-resource model development.

major comments (1)

[Methods (hybrid labeling)] Methods (hybrid labeling): the ground-truth labels are constructed by combining star ratings with an independent XLM-RoBERTa classifier (κ=0.459). Because the evaluation set therefore contains predictions from a model architecturally close to the fine-tuned XLM-RoBERTa whose accuracy is reported as 0.793, any systematic alignment between the labeler and the fine-tuned model can inflate the transformer’s apparent performance relative to Random Forest and Linear SVM. No ablation that recomputes all metrics on star-rating labels alone is presented, leaving the central claim that classical models outperform transformers vulnerable to label-construction bias.

minor comments (2)

[Abstract] Abstract: the 16.1-percentage-point accuracy gap between Bangla and English is stated without indicating which model(s) or evaluation split produced the figure.
[Results] Results: McNemar’s test p-values are given only for the off-the-shelf baseline; the exact p-values for the fine-tuned XLM-RoBERTa comparisons should be reported to allow readers to assess the “not statistically significant” claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The concern about potential bias in our hybrid labeling scheme is well-taken, and we address it directly below. We are committed to strengthening the manuscript through the requested revisions.

read point-by-point responses

Referee: [Methods (hybrid labeling)] Methods (hybrid labeling): the ground-truth labels are constructed by combining star ratings with an independent XLM-RoBERTa classifier (κ=0.459). Because the evaluation set therefore contains predictions from a model architecturally close to the fine-tuned XLM-RoBERTa whose accuracy is reported as 0.793, any systematic alignment between the labeler and the fine-tuned model can inflate the transformer’s apparent performance relative to Random Forest and Linear SVM. No ablation that recomputes all metrics on star-rating labels alone is presented, leaving the central claim that classical models outperform transformers vulnerable to label-construction bias.

Authors: We appreciate the referee's identification of this methodological vulnerability. The hybrid labels were constructed to leverage both explicit star ratings and an independent XLM-RoBERTa model (distinct training run and data split from the fine-tuned evaluation model), with the moderate κ=0.459 reflecting only partial alignment. Nevertheless, we acknowledge that architectural similarity could introduce systematic bias favoring the transformer. In the revised manuscript we will add a full ablation recomputing accuracy, weighted F1, and McNemar tests using star-rating labels alone. This will allow readers to assess the robustness of the classical-model advantage under a purely human-derived ground truth. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on held-out data with independent labeling

full rationale

The paper conducts a standard supervised learning evaluation: reviews are labeled via a hybrid process (star ratings plus a separate independent XLM-RoBERTa classifier), the labeled corpus is split, models (including fine-tuned XLM-RoBERTa) are trained on the training portion, and accuracy/F1/McNemar results are reported on the held-out test set. No equations, derivations, or self-referential definitions appear; performance numbers are direct empirical measurements rather than quantities forced by construction from fitted parameters or prior self-citations. The labeling step is external to the model training loop and does not reduce any reported metric to the inputs by definition. This is a normal non-circular empirical ML comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard supervised learning assumptions and the validity of Google Play review text as representative of user experience; no new entities or ad-hoc axioms are introduced.

pith-pipeline@v0.9.0 · 5639 in / 1082 out tokens · 32046 ms · 2026-05-15T09:18:03.943716+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Annual Report 2022–2023,

Bangladesh Bank, “Annual Report 2022–2023,” Bangladesh Bank, Dhaka, Bangladesh, 2023. [Online]. Available: https://www.bb.org.bd/ pub/annual/anreport/ar2022-2023.pdf

work page 2022
[2]

Bangladesh Economic Review 2022,

Ministry of Finance, “Bangladesh Economic Review 2022,” Finance Division, Government of the People’s Republic of Bangladesh, Dhaka,

work page 2022
[3]

Available: https://mof.gov.bd

[Online]. Available: https://mof.gov.bd

work page
[4]

A survey of app store analysis for software engineering,

W. Martin, F. Sarro, Y . Jia, Y . Zhang, and M. Harman, “A survey of app store analysis for software engineering,”IEEE Trans. Softw. Eng., vol. 43, no. 9, pp. 817–847, Sep. 2017

work page 2017
[5]

Un- supervised cross-lingual representation learning at scale,

A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Un- supervised cross-lingual representation learning at scale,” inProc. 58th Annu. Meeting Assoc. Comput. Linguistics (ACL), 2020, pp. 8440–8451

work page 2020
[6]

SentNoB: A dataset for analysing sentiment on noisy Bangla texts,

K. I. Islam, S. Kar, Md S. Islam, and M. R. Amin, “SentNoB: A dataset for analysing sentiment on noisy Bangla texts,” inFindings Assoc. Comput. Linguistics: EMNLP 2021, pp. 3265–3271

work page 2021
[7]

Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning based methods,

O. Sen, M. Fuad, Md N. Islam, J. Rabbi, M. Masud, Md K. Hasan, Md A. Awal, A. A. Fime, Md T. H. Fuad, D. Sikder, and Md A. R. Iftee, “Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning based methods,”IEEE Access, vol. 10, pp. 38999–39044, 2022

work page 2022
[8]

XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond,

F. Barbieri, L. Espinosa Anke, and J. Camacho-Collados, “XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond,” inProc. LREC, Marseille, France, 2022, pp. 258–266

work page 2022
[9]

Approximate statistical tests for comparing supervised classification learning algorithms,

T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,”Neural Comput., vol. 10, no. 7, pp. 1895–1923, 1998

work page 1923
[10]

DeBERTa: Decoding-enhanced BERT with disentangled attention,

P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with disentangled attention,” inProc. 9th Int. Conf. Learning Representations (ICLR), 2021

work page 2021
[11]

How do users like this feature? A fine grained sentiment analysis of app reviews,

E. Guzman and W. Maalej, “How do users like this feature? A fine grained sentiment analysis of app reviews,” inProc. IEEE 22nd Int. Requirements Eng. Conf. (RE), Karlskrona, Sweden, 2014, pp. 153–162

work page 2014
[12]

Discrepancy detection between actual user reviews and numeric ratings of Google App Store using deep learning,

S. Sadiq, A. Umer, S. Ullah, S. Mirjalili, V . Rupapara, and M. Nappi, “Discrepancy detection between actual user reviews and numeric ratings of Google App Store using deep learning,”Expert Syst. Appl., vol. 181, Nov. 2021, Art. no. 115111

work page 2021
[13]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProc. NAACL-HLT, Minneapolis, MN, USA, 2019, pp. 4171–4186

work page 2019
[14]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y . Liuet al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv:1907.11692, Jul. 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[15]

Sentiment analysis of Google Play and App Store reviews of Threads,

A. Verma and R. Vashisth, “Sentiment analysis of Google Play and App Store reviews of Threads,” inProc. IEEE ICRITO, 2024

work page 2024
[16]

BanglaBook: A large-scale Bangla dataset for sentiment analysis from book reviews,

M. Kabir, O. B. Mahfuz, S. R. Raiyan, H. Mahmud, and M. K. Hasan, “BanglaBook: A large-scale Bangla dataset for sentiment analysis from book reviews,” inFindings ACL, Toronto, Canada, 2023, pp. 3521–3536

work page 2023
[17]

BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla,

A. Bhattacharjeeet al., “BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla,” inFindings NAACL, Seattle, W A, USA, 2022, pp. 1318–1327

work page 2022
[18]

BLP-2023 Task 2: Sentiment analysis,

Md A. Hasan, F. Alam, A. Anjum, S. Das, and A. Anjum, “BLP-2023 Task 2: Sentiment analysis,” inProc. 1st Workshop Bangla Language Processing (BLP), EMNLP, Singapore, 2023, pp. 354–364

work page 2023
[19]

Prompt-based fine-tuning with multilingual transformers for language-independent sentiment analysis,

F. Ullah, S. Faizullah, I. U. Khan, T. Alghamdi, T. A. Syed, A. B. Alkho- dre, M. S. Ayub, and A. Karim, “Prompt-based fine-tuning with multilingual transformers for language-independent sentiment analysis,” Scientific Reports, vol. 15, Art. no. 20834, 2025

work page 2025
[20]

SemEval-2014 Task 4: Aspect based sen- timent analysis,

M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. Androut- sopoulos, and S. Manandhar, “SemEval-2014 Task 4: Aspect based sen- timent analysis,” inProc. 8th Int. Workshop Semantic Eval. (SemEval- 2014), COLING, Dublin, Ireland, 2014, pp. 27–35

work page 2014
[21]

Aspect-based sentiment analysis using local context focus mechanism with DeBERTa,

T. Zhao, J. Du, Z. Xue, A. Li, and Z. Guan, “Aspect-based sentiment analysis using local context focus mechanism with DeBERTa,” inProc. 5th Int. Conf. Data-driven Optim. Complex Syst. (DOCS), IEEE, 2023

work page 2023
[22]

Cross-lingual aspect- based sentiment analysis with aspect term code-switching,

W. Zhang, R. He, H. Peng, L. Bing, and W. Lam, “Cross-lingual aspect- based sentiment analysis with aspect term code-switching,” inProc. EMNLP, 2021, pp. 9220–9230

work page 2021
[23]

Instruct-DeBERTa: A hybrid approach for aspect- based sentiment analysis on textual reviews,

D. Jayakodyet al., “Instruct-DeBERTa: A hybrid approach for aspect- based sentiment analysis on textual reviews,” arXiv:2408.13202, Aug. 2024

work page arXiv 2024
[24]

Recognizing factors effecting the use of mobile banking apps through sentiment and thematic analysis on user reviews,

T. Mahmood, S. Naseem, R. Ashraf, M. Asif, M. Umair, and M. Shah, “Recognizing factors effecting the use of mobile banking apps through sentiment and thematic analysis on user reviews,”Neural Comput. Appl., vol. 35, pp. 23853–23870, 2023

work page 2023
[25]

Comparison of topic modelling approaches in the banking context,

B. Ogunleye, T. Maswera, L. Hirsch, J. Gaudoin, and T. Brunsdon, “Comparison of topic modelling approaches in the banking context,” Applied Sciences, vol. 13, no. 2, Art. no. 797, 2023

work page 2023
[26]

Crowdsourcing user reviews to support the evolution of mobile apps,

F. Palomba, M. Linares-V ´asquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia, “Crowdsourcing user reviews to support the evolution of mobile apps,”J. Syst. Softw., vol. 137, pp. 143– 162, 2018

work page 2018
[27]

Studying the dialogue between users and developers of free apps in the Google Play Store,

S. Hassan, C. Tantithamthavorn, C.-P. Bezemer, and A. E. Hassan, “Studying the dialogue between users and developers of free apps in the Google Play Store,”Empirical Softw. Eng., vol. 23, no. 3, pp. 1275– 1312, 2018

work page 2018
[28]

FinTech: Deep learning-based sentiment classifica- tion of user reviews from various Bangladeshi mobile financial services,

A. A. Ryanet al., “FinTech: Deep learning-based sentiment classifica- tion of user reviews from various Bangladeshi mobile financial services,” inComputational Intelligence in Data Science (IFIP AICT), Springer, 2023, pp. 126–140

work page 2023
[29]

Scikit-learn: Machine learning in Python,

F. Pedregosaet al., “Scikit-learn: Machine learning in Python,”J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011

work page 2011