pith. machine review for the scientific record. sign in

arxiv: 2604.13057 · v1 · submitted 2026-03-17 · 💻 cs.CL

Recognition: no theorem link

A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords sentiment classificationBangla NLPmobile banking appstraditional machine learningtransformer modelsaspect-based sentiment analysisgovernment appslow-resource languages
0
0 comments X

The pith

Classical models like Random Forest outperform fine-tuned transformers for sentiment analysis of English-Bangla government banking app reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes 5,652 English and Bangla reviews of four Bangladeshi government mobile banking apps. It shows that traditional machine learning models achieve higher accuracy and F1 scores than transformer-based models on this task. Aspect-level analysis identifies transaction speed and interface design as the main user complaints, with one app rated worst overall. These results highlight practical needs for better monitoring and fixes in digital financial services for developing economies.

Core claim

The central claim is that traditional models such as Random Forest (0.815 accuracy) and Linear SVM (0.804 weighted F1) outperform a fine-tuned XLM-RoBERTa (0.793) on bilingual sentiment classification of app reviews, with McNemar's test confirming classical models are significantly better than the off-the-shelf transformer variant. Aspect analysis with DeBERTa-v3 shows primary dissatisfaction with transaction speed and interface design, and the eJanata app receives the lowest ratings. A 16.1-percentage-point performance gap between Bangla and English text is noted.

What carries the argument

The hybrid labeling method that combines reviewer star ratings with an independent XLM-RoBERTa classifier to generate ground-truth labels for model training and evaluation.

If this is right

  • Random Forest achieves the highest accuracy of 0.815 on the combined English-Bangla dataset.
  • Linear SVM achieves the highest weighted F1 score of 0.804.
  • Aspect-level sentiment shows the strongest user dissatisfaction with transaction speed and app interface design.
  • The eJanata app receives the worst ratings across all four government apps studied.
  • A 16.1 percentage point accuracy gap between Bangla and English text indicates specific challenges for low-resource language processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Classical models may offer a more practical and lower-cost option than transformer fine-tuning for similar sentiment tasks in other low-resource languages.
  • Prioritizing fixes for transaction speed and interface issues could improve user trust in government digital banking services.
  • The performance gap suggests targeted development of Bangla-specific resources could narrow differences in future models.
  • Similar multi-model comparisons could be applied to reviews of other public service apps in comparable economies.

Load-bearing premise

The hybrid labeling method that combines star ratings with an independent XLM-RoBERTa classifier produces sufficiently reliable ground-truth labels despite only moderate inter-method agreement.

What would settle it

A replication study that replaces the hybrid labels with fully independent human annotations and finds that fine-tuned transformers then match or exceed the accuracy of Random Forest and Linear SVM would falsify the performance superiority claim.

Figures

Figures reproduced from arXiv: 2604.13057 by Md. Binyamin, Md Jahid Hasan Imran, Md Muhtasim Munif Fahim, Md. Naim Molla, Md Rezaul Karim, Nura Rayhan, Tonmoy Shil.

Figure 1
Figure 1. Figure 1: Distribution of raw and cleaned reviews across the four applications. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Aspect-based sentiment polarity across six service dimensions. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Longitudinal sentiment trends (2021–2025) across the four applications. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

For millions of users in developing economies who depend on mobile banking as their primary gateway to financial services, app quality directly shapes financial access. The study analyzed 5,652 Google Play reviews in English and Bangla (filtered from 11,414 raw reviews) for four Bangladeshi government banking apps. The authors used a hybrid labeling approach that combined use of the reviewer's star rating for each review along with a separate independent XLM-RoBERTa classifier to produce moderate inter-method agreement (kappa = 0.459). Traditional models outperformed transformer-based ones: Random Forest produced the highest accuracy (0.815), while Linear SVM produced the highest weighted F1 score (0.804); both were higher than the performance of fine-tuned XLM-RoBERTa (0.793). McNemar's test confirmed that all classical models were significantly superior to the off-the-shelf XLM-RoBERTa (p < 0.05), while differences with the fine-tuned variant were not statistically significant. DeBERTa-v3 was applied to analyze the sentiment at the aspect level across the reviews for the four apps; the reviewers expressed their dissatisfaction primarily with the speed of transactions and with the poor design of interfaces; eJanata app received the worst ratings from the reviewers across all apps. Three policy recommendations are made based on these findings - remediation of app quality, trust-centred release management, and Bangla-first NLP adoption - to assist state-owned banks in moving towards improving their digital services through data-driven methods. Notably, a 16.1-percentage-point accuracy gap between Bangla and English text highlights the need for low-resource language model development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper analyzes 5,652 English and Bangla Google Play reviews for four Bangladeshi government mobile banking apps. It employs a hybrid labeling scheme that merges star ratings with predictions from an independent XLM-RoBERTa classifier (κ=0.459), then compares classical models (Random Forest at 0.815 accuracy, Linear SVM at 0.804 weighted F1) against fine-tuned XLM-RoBERTa (0.793 accuracy) and an off-the-shelf baseline. McNemar’s tests show classical models significantly outperform the off-the-shelf transformer but not the fine-tuned variant. Aspect-level sentiment via DeBERTa-v3 identifies transaction speed and interface design as primary pain points, with eJanata rated worst; three policy recommendations follow.

Significance. If the performance ordering is robust, the work supplies concrete evidence that classical models can remain competitive for low-resource bilingual sentiment tasks in a high-stakes public-service domain. The aspect-level findings and policy suggestions offer direct utility for app developers and regulators in developing economies. The reported 16.1-point English–Bangla gap also underscores the need for targeted low-resource model development.

major comments (1)
  1. [Methods (hybrid labeling)] Methods (hybrid labeling): the ground-truth labels are constructed by combining star ratings with an independent XLM-RoBERTa classifier (κ=0.459). Because the evaluation set therefore contains predictions from a model architecturally close to the fine-tuned XLM-RoBERTa whose accuracy is reported as 0.793, any systematic alignment between the labeler and the fine-tuned model can inflate the transformer’s apparent performance relative to Random Forest and Linear SVM. No ablation that recomputes all metrics on star-rating labels alone is presented, leaving the central claim that classical models outperform transformers vulnerable to label-construction bias.
minor comments (2)
  1. [Abstract] Abstract: the 16.1-percentage-point accuracy gap between Bangla and English is stated without indicating which model(s) or evaluation split produced the figure.
  2. [Results] Results: McNemar’s test p-values are given only for the off-the-shelf baseline; the exact p-values for the fine-tuned XLM-RoBERTa comparisons should be reported to allow readers to assess the “not statistically significant” claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The concern about potential bias in our hybrid labeling scheme is well-taken, and we address it directly below. We are committed to strengthening the manuscript through the requested revisions.

read point-by-point responses
  1. Referee: [Methods (hybrid labeling)] Methods (hybrid labeling): the ground-truth labels are constructed by combining star ratings with an independent XLM-RoBERTa classifier (κ=0.459). Because the evaluation set therefore contains predictions from a model architecturally close to the fine-tuned XLM-RoBERTa whose accuracy is reported as 0.793, any systematic alignment between the labeler and the fine-tuned model can inflate the transformer’s apparent performance relative to Random Forest and Linear SVM. No ablation that recomputes all metrics on star-rating labels alone is presented, leaving the central claim that classical models outperform transformers vulnerable to label-construction bias.

    Authors: We appreciate the referee's identification of this methodological vulnerability. The hybrid labels were constructed to leverage both explicit star ratings and an independent XLM-RoBERTa model (distinct training run and data split from the fine-tuned evaluation model), with the moderate κ=0.459 reflecting only partial alignment. Nevertheless, we acknowledge that architectural similarity could introduce systematic bias favoring the transformer. In the revised manuscript we will add a full ablation recomputing accuracy, weighted F1, and McNemar tests using star-rating labels alone. This will allow readers to assess the robustness of the classical-model advantage under a purely human-derived ground truth. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on held-out data with independent labeling

full rationale

The paper conducts a standard supervised learning evaluation: reviews are labeled via a hybrid process (star ratings plus a separate independent XLM-RoBERTa classifier), the labeled corpus is split, models (including fine-tuned XLM-RoBERTa) are trained on the training portion, and accuracy/F1/McNemar results are reported on the held-out test set. No equations, derivations, or self-referential definitions appear; performance numbers are direct empirical measurements rather than quantities forced by construction from fitted parameters or prior self-citations. The labeling step is external to the model training loop and does not reduce any reported metric to the inputs by definition. This is a normal non-circular empirical ML comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard supervised learning assumptions and the validity of Google Play review text as representative of user experience; no new entities or ad-hoc axioms are introduced.

pith-pipeline@v0.9.0 · 5639 in / 1082 out tokens · 32046 ms · 2026-05-15T09:18:03.943716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Annual Report 2022–2023,

    Bangladesh Bank, “Annual Report 2022–2023,” Bangladesh Bank, Dhaka, Bangladesh, 2023. [Online]. Available: https://www.bb.org.bd/ pub/annual/anreport/ar2022-2023.pdf

  2. [2]

    Bangladesh Economic Review 2022,

    Ministry of Finance, “Bangladesh Economic Review 2022,” Finance Division, Government of the People’s Republic of Bangladesh, Dhaka,

  3. [3]

    Available: https://mof.gov.bd

    [Online]. Available: https://mof.gov.bd

  4. [4]

    A survey of app store analysis for software engineering,

    W. Martin, F. Sarro, Y . Jia, Y . Zhang, and M. Harman, “A survey of app store analysis for software engineering,”IEEE Trans. Softw. Eng., vol. 43, no. 9, pp. 817–847, Sep. 2017

  5. [5]

    Un- supervised cross-lingual representation learning at scale,

    A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Un- supervised cross-lingual representation learning at scale,” inProc. 58th Annu. Meeting Assoc. Comput. Linguistics (ACL), 2020, pp. 8440–8451

  6. [6]

    SentNoB: A dataset for analysing sentiment on noisy Bangla texts,

    K. I. Islam, S. Kar, Md S. Islam, and M. R. Amin, “SentNoB: A dataset for analysing sentiment on noisy Bangla texts,” inFindings Assoc. Comput. Linguistics: EMNLP 2021, pp. 3265–3271

  7. [7]

    Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning based methods,

    O. Sen, M. Fuad, Md N. Islam, J. Rabbi, M. Masud, Md K. Hasan, Md A. Awal, A. A. Fime, Md T. H. Fuad, D. Sikder, and Md A. R. Iftee, “Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning based methods,”IEEE Access, vol. 10, pp. 38999–39044, 2022

  8. [8]

    XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond,

    F. Barbieri, L. Espinosa Anke, and J. Camacho-Collados, “XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond,” inProc. LREC, Marseille, France, 2022, pp. 258–266

  9. [9]

    Approximate statistical tests for comparing supervised classification learning algorithms,

    T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,”Neural Comput., vol. 10, no. 7, pp. 1895–1923, 1998

  10. [10]

    DeBERTa: Decoding-enhanced BERT with disentangled attention,

    P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with disentangled attention,” inProc. 9th Int. Conf. Learning Representations (ICLR), 2021

  11. [11]

    How do users like this feature? A fine grained sentiment analysis of app reviews,

    E. Guzman and W. Maalej, “How do users like this feature? A fine grained sentiment analysis of app reviews,” inProc. IEEE 22nd Int. Requirements Eng. Conf. (RE), Karlskrona, Sweden, 2014, pp. 153–162

  12. [12]

    Discrepancy detection between actual user reviews and numeric ratings of Google App Store using deep learning,

    S. Sadiq, A. Umer, S. Ullah, S. Mirjalili, V . Rupapara, and M. Nappi, “Discrepancy detection between actual user reviews and numeric ratings of Google App Store using deep learning,”Expert Syst. Appl., vol. 181, Nov. 2021, Art. no. 115111

  13. [13]

    BERT: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProc. NAACL-HLT, Minneapolis, MN, USA, 2019, pp. 4171–4186

  14. [14]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liuet al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv:1907.11692, Jul. 2019

  15. [15]

    Sentiment analysis of Google Play and App Store reviews of Threads,

    A. Verma and R. Vashisth, “Sentiment analysis of Google Play and App Store reviews of Threads,” inProc. IEEE ICRITO, 2024

  16. [16]

    BanglaBook: A large-scale Bangla dataset for sentiment analysis from book reviews,

    M. Kabir, O. B. Mahfuz, S. R. Raiyan, H. Mahmud, and M. K. Hasan, “BanglaBook: A large-scale Bangla dataset for sentiment analysis from book reviews,” inFindings ACL, Toronto, Canada, 2023, pp. 3521–3536

  17. [17]

    BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla,

    A. Bhattacharjeeet al., “BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla,” inFindings NAACL, Seattle, W A, USA, 2022, pp. 1318–1327

  18. [18]

    BLP-2023 Task 2: Sentiment analysis,

    Md A. Hasan, F. Alam, A. Anjum, S. Das, and A. Anjum, “BLP-2023 Task 2: Sentiment analysis,” inProc. 1st Workshop Bangla Language Processing (BLP), EMNLP, Singapore, 2023, pp. 354–364

  19. [19]

    Prompt-based fine-tuning with multilingual transformers for language-independent sentiment analysis,

    F. Ullah, S. Faizullah, I. U. Khan, T. Alghamdi, T. A. Syed, A. B. Alkho- dre, M. S. Ayub, and A. Karim, “Prompt-based fine-tuning with multilingual transformers for language-independent sentiment analysis,” Scientific Reports, vol. 15, Art. no. 20834, 2025

  20. [20]

    SemEval-2014 Task 4: Aspect based sen- timent analysis,

    M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. Androut- sopoulos, and S. Manandhar, “SemEval-2014 Task 4: Aspect based sen- timent analysis,” inProc. 8th Int. Workshop Semantic Eval. (SemEval- 2014), COLING, Dublin, Ireland, 2014, pp. 27–35

  21. [21]

    Aspect-based sentiment analysis using local context focus mechanism with DeBERTa,

    T. Zhao, J. Du, Z. Xue, A. Li, and Z. Guan, “Aspect-based sentiment analysis using local context focus mechanism with DeBERTa,” inProc. 5th Int. Conf. Data-driven Optim. Complex Syst. (DOCS), IEEE, 2023

  22. [22]

    Cross-lingual aspect- based sentiment analysis with aspect term code-switching,

    W. Zhang, R. He, H. Peng, L. Bing, and W. Lam, “Cross-lingual aspect- based sentiment analysis with aspect term code-switching,” inProc. EMNLP, 2021, pp. 9220–9230

  23. [23]

    Instruct-DeBERTa: A hybrid approach for aspect- based sentiment analysis on textual reviews,

    D. Jayakodyet al., “Instruct-DeBERTa: A hybrid approach for aspect- based sentiment analysis on textual reviews,” arXiv:2408.13202, Aug. 2024

  24. [24]

    Recognizing factors effecting the use of mobile banking apps through sentiment and thematic analysis on user reviews,

    T. Mahmood, S. Naseem, R. Ashraf, M. Asif, M. Umair, and M. Shah, “Recognizing factors effecting the use of mobile banking apps through sentiment and thematic analysis on user reviews,”Neural Comput. Appl., vol. 35, pp. 23853–23870, 2023

  25. [25]

    Comparison of topic modelling approaches in the banking context,

    B. Ogunleye, T. Maswera, L. Hirsch, J. Gaudoin, and T. Brunsdon, “Comparison of topic modelling approaches in the banking context,” Applied Sciences, vol. 13, no. 2, Art. no. 797, 2023

  26. [26]

    Crowdsourcing user reviews to support the evolution of mobile apps,

    F. Palomba, M. Linares-V ´asquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia, “Crowdsourcing user reviews to support the evolution of mobile apps,”J. Syst. Softw., vol. 137, pp. 143– 162, 2018

  27. [27]

    Studying the dialogue between users and developers of free apps in the Google Play Store,

    S. Hassan, C. Tantithamthavorn, C.-P. Bezemer, and A. E. Hassan, “Studying the dialogue between users and developers of free apps in the Google Play Store,”Empirical Softw. Eng., vol. 23, no. 3, pp. 1275– 1312, 2018

  28. [28]

    FinTech: Deep learning-based sentiment classifica- tion of user reviews from various Bangladeshi mobile financial services,

    A. A. Ryanet al., “FinTech: Deep learning-based sentiment classifica- tion of user reviews from various Bangladeshi mobile financial services,” inComputational Intelligence in Data Science (IFIP AICT), Springer, 2023, pp. 126–140

  29. [29]

    Scikit-learn: Machine learning in Python,

    F. Pedregosaet al., “Scikit-learn: Machine learning in Python,”J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011