Recognition: no theorem link
Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset
Pith reviewed 2026-05-10 18:36 UTC · model grok-4.3
The pith
A hybrid model fuses EfficientNetB3, Vision Transformer and Conformer features via multi-head cross-attention to recognize 78 classes of handwritten Bangla characters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that an interaction-aware hybrid network combining EfficientNetB3, Vision Transformer and Conformer branches, fused by multi-head cross-attention, achieves 98.84 percent accuracy on the newly collected 78-class Bangla handwritten character dataset and 96.49 percent on the independent CHBCR benchmark while providing interpretable visualizations through Grad-CAM.
What carries the argument
The multi-head cross-attention fusion mechanism that integrates feature maps from the parallel EfficientNetB3, Vision Transformer and Conformer modules.
If this is right
- Better handling of composite characters and numerals that share similar visual shapes.
- Demonstrated generalization on an external benchmark dataset beyond the training collection.
- Public availability of the dataset and code for other researchers to build upon.
- Grad-CAM visualizations that reveal which stroke regions the model relies on for each class.
Where Pith is reading between the lines
- The same fusion pattern could be tested on other scripts that exhibit high intra-class variation and inter-class similarity.
- Higher character-level accuracy may improve downstream systems such as automatic transcription of printed or handwritten Bengali texts.
- Collecting data from left- and right-handed writers across age groups may become a practical standard for building robust recognition models.
Load-bearing premise
Performance gains come mainly from the cross-attention interaction rather than simply from using three large models or from the balanced dataset alone.
What would settle it
Train EfficientNetB3, Vision Transformer and Conformer separately on the same dataset and compare their accuracies to the full hybrid model to test whether the fusion step adds measurable value.
Figures
read the original abstract
Character recognition is the fundamental part of an optical character recognition (OCR) system. Word recognition, sentence transcription, document digitization, and language processing are some of the higher-order activities that can be done accurately through character recognition. Nonetheless, recognizing handwritten Bangla characters is not an easy task because they are written in different styles with inconsistent stroke patterns and a high degree of visual character resemblance. The datasets available are usually limited in intra-class and inequitable in class distribution. We have constructed a new balanced dataset of Bangla written characters to overcome those problems. This consists of 78 classes and each class has approximately 650 samples. It contains the basic characters, composite (Juktobarno) characters and numerals. The samples were a diverse group comprising a large age range and socioeconomic groups. Elementary and high school students, university students, and professionals are the contributing factors. The sample also has right and left-handed writers. We have further proposed an interaction-aware hybrid deep learning architecture that integrates EfficientNetB3, Vision Transformer, and Conformer modules in parallel. A multi-head cross-attention fusion mechanism enables effective feature interaction across these components. The proposed model achieves 98.84% accuracy on the constructed dataset and 96.49% on the external CHBCR benchmark, demonstrating strong generalization capability. Grad-CAM visualizations further provide interpretability by highlighting discriminative regions. The dataset and source code of this research is publicly available at: https://huggingface.co/MIRZARAQUIB/Bangla_Handwritten_Character_Recognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a new balanced dataset of Bangla handwritten characters (78 classes, ~650 samples per class collected from diverse demographics including age, socioeconomic status, and handedness) and proposes a hybrid architecture integrating EfficientNetB3, Vision Transformer, and Conformer modules in parallel with a multi-head cross-attention fusion mechanism. It reports 98.84% accuracy on the internal test set of the new dataset and 96.49% on the external CHBCR benchmark, supported by Grad-CAM visualizations for interpretability, with the dataset and code released publicly.
Significance. If the results hold, the work contributes a publicly available, balanced dataset addressing common limitations in existing Bangla character resources and shows competitive performance with apparent generalization across benchmarks. The public release of data and source code supports reproducibility and further research on hybrid attention-based models for low-resource scripts.
major comments (2)
- Experimental evaluation section: The central performance claims (98.84% and 96.49%) and attribution to the 'interaction-aware' design are not supported by ablation studies comparing the full multi-head cross-attention fusion model against the three backbones used individually or in simple ensemble/combination without the fusion module. Without these controls, it is not possible to determine whether the reported gains arise primarily from the proposed interaction mechanism.
- Dataset and experimental setup sections: The manuscript does not specify the exact train/validation/test split ratios, any class-wise distribution verification beyond the stated balance, data augmentation strategies, or training hyperparameters (optimizer, learning rate, batch size, epochs) used to obtain the reported accuracies. These details are required to assess the reliability and reproducibility of the generalization claim on the external benchmark.
minor comments (2)
- Abstract and introduction: The total dataset size and any preprocessing steps (e.g., resizing, normalization) applied to inputs for the three parallel branches are not stated explicitly.
- Grad-CAM section: The visualizations would benefit from quantitative metrics (e.g., localization accuracy or comparison to attention maps from individual backbones) rather than qualitative description alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will revise the manuscript accordingly to strengthen the work.
read point-by-point responses
-
Referee: [—] Experimental evaluation section: The central performance claims (98.84% and 96.49%) and attribution to the 'interaction-aware' design are not supported by ablation studies comparing the full multi-head cross-attention fusion model against the three backbones used individually or in simple ensemble/combination without the fusion module. Without these controls, it is not possible to determine whether the reported gains arise primarily from the proposed interaction mechanism.
Authors: We agree that ablation studies are required to rigorously attribute performance gains to the multi-head cross-attention fusion rather than to the individual backbones or simpler combinations. In the revised manuscript we will add these controls, reporting accuracy for each backbone run individually, for pairwise combinations, and for a simple ensemble without the cross-attention module. The new results will be presented in the experimental evaluation section alongside the existing figures. revision: yes
-
Referee: [—] Dataset and experimental setup sections: The manuscript does not specify the exact train/validation/test split ratios, any class-wise distribution verification beyond the stated balance, data augmentation strategies, or training hyperparameters (optimizer, learning rate, batch size, epochs) used to obtain the reported accuracies. These details are required to assess the reliability and reproducibility of the generalization claim on the external benchmark.
Authors: We acknowledge that these implementation details were omitted and are essential for reproducibility. The revised manuscript will explicitly state the train/validation/test split ratios, confirm that class balance was verified in each split, describe the data-augmentation pipeline, and list all training hyperparameters (optimizer, learning-rate schedule, batch size, and epoch count). These additions will appear in the Dataset and Experimental Setup sections. revision: yes
Circularity Check
No circularity: empirical accuracies from standard evaluation
full rationale
The paper reports direct test-set accuracies (98.84% on the new balanced 78-class dataset, 96.49% on external CHBCR) obtained by training the hybrid EfficientNetB3+ViT+Conformer model with multi-head cross-attention. No equations, first-principles derivations, or parameter predictions are presented that reduce to fitted inputs or self-citations by construction. Dataset construction and architecture choices are described explicitly without self-referential loops. Lack of ablations is a separate evidence-strength issue, not circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Training and test samples are drawn from the same underlying distribution of real-world Bangla handwriting.
- standard math Standard supervised training of the hybrid network converges to a solution that reflects the claimed accuracy.
Reference graph
Works this paper leans on
-
[1]
Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR),
J. Memon, M. Sami, R. A. Khan, and M. Uddin, “Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR),”IEEE Access, vol. 8, pp. 142642–142668, 2020
2020
-
[2]
BNVGLENET: Hypercomplex Bangla handwriting character recognition with hierar- chical class expansion using convolutional neural networks,
J. O. Bappi, M. A. T. Rony, and M. S. Islam, “BNVGLENET: Hypercomplex Bangla handwriting character recognition with hierar- chical class expansion using convolutional neural networks,”Natural Language Processing Journal, vol. 7, p. 100068, 2024
2024
-
[3]
Handwritten Bangla character recognition: A comprehensive review,
A. R. Chowdhury et al., “Handwritten Bangla character recognition: A comprehensive review,”IEEE Access, vol. 8, pp. 110250–110275, 2020
2020
-
[4]
Bangla handwritten character recognition using deep learning,
M. M. Rahman et al., “Bangla handwritten character recognition using deep learning,”Applied Intelligence, vol. 49, pp. 3341–3354, 2019
2019
-
[5]
Improved handwritten digit recognition using convolutional neural networks (CNN),
S. Ahlawat, A. Choudhary, A. Nayyar, S. Singh, and B. Yoon, “Improved handwritten digit recognition using convolutional neural networks (CNN),”Sensors, vol. 20, no. 12, p. 3344, 2020
2020
-
[6]
Benchmarking on offline handwrit- ten Tamil character recognition using convolutional neural networks,
B. R. Kavitha and C. B. Srimathi, “Benchmarking on offline handwrit- ten Tamil character recognition using convolutional neural networks,” Journal of King Saud University – Computer and Information Sciences, vol. 34, no. 4, pp. 1183–1190, 2022
2022
-
[7]
UrduDeep- Net: Offline handwritten Urdu character recognition using deep neural network,
F. Mushtaq, M. M. Misgar, M. Kumar, and S. S. Khurana, “UrduDeep- Net: Offline handwritten Urdu character recognition using deep neural network,”Neural Computing and Applications, vol. 33, no. 22, pp. 15229–15252, 2021
2021
-
[8]
Offline MODI script character recognition using deep learning techniques,
C. Chandankhede and R. Sachdeo, “Offline MODI script character recognition using deep learning techniques,”Multimedia Tools and Applications, 2023
2023
-
[9]
Performance analysis of state-of-the-art convolutional neural network architectures in Bangla handwritten character recog- nition,
T. Ghosh et al., “Performance analysis of state-of-the-art convolutional neural network architectures in Bangla handwritten character recog- nition,”Pattern Recognition and Image Analysis, vol. 31, no. 1, pp. 60–71, 2021
2021
-
[10]
Leveraging transfer learning and mobile-enabled convolutional neural networks for improved Arabic handwritten char- acter recognition,
M. El Khayati et al., “Leveraging transfer learning and mobile-enabled convolutional neural networks for improved Arabic handwritten char- acter recognition,”IEEE Access, 2025
2025
-
[11]
HTR-VT: Handwritten text recognition with vision transformer,
Y . Li, D. Chen, T. Tang, and X. Shen, “HTR-VT: Handwritten text recognition with vision transformer,”Pattern Recognition, vol. 158, p. 110967, 2025
2025
-
[12]
CHBCR-DB dataset,
I. M. Towhid et al., “CHBCR-DB dataset,”Mendeley Data, 2020
2020
-
[13]
Characters as graphs: Interpretable handwritten Chinese character recognition via pyramid graph transformer,
J. Gan et al., “Characters as graphs: Interpretable handwritten Chinese character recognition via pyramid graph transformer,”Pattern Recog- nition, vol. 137, p. 109317, 2023
2023
-
[14]
Integrating CNN and transformer ar- chitectures for superior Arabic printed and handwriting characters classification,
M. R. Al-Maamari et al., “Integrating CNN and transformer ar- chitectures for superior Arabic printed and handwriting characters classification,”Scientific Reports, vol. 15, no. 1, p. 29936, 2025. 14
2025
-
[15]
Performance analysis of hybrid deep learning framework using a vision transformer and convolutional neural network for handwritten digit recognition,
V . Agrawal et al., “Performance analysis of hybrid deep learning framework using a vision transformer and convolutional neural network for handwritten digit recognition,”MethodsX, vol. 12, p. 102554, 2024
2024
-
[16]
Advancing multilingual handwritten numeral recogni- tion with attention-driven transfer learning,
A. Fateh et al., “Advancing multilingual handwritten numeral recogni- tion with attention-driven transfer learning,”IEEE Access, vol. 12, pp. 41381–41395, 2024
2024
-
[17]
A self-attention-based deep architecture for online handwriting recognition,
S. A. Molavi and B. BabaAli, “A self-attention-based deep architecture for online handwriting recognition,”Neural Computing and Applica- tions, vol. 36, no. 27, pp. 17165–17179, 2024
2024
-
[18]
End-to-end handwritten paragraph text recognition using a vertical attention network,
D. Coquenet, C. Chatelain, and T. Paquet, “End-to-end handwritten paragraph text recognition using a vertical attention network,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 508–524, 2022
2022
-
[19]
EkushNet: Using convolutional neu- ral network for Bangla handwritten recognition,
A. K. M. S. A. Rabby et al., “EkushNet: Using convolutional neu- ral network for Bangla handwritten recognition,”Procedia Computer Science, vol. 143, pp. 603–610, 2018
2018
-
[20]
RATNet: A deep learning model for Bengali hand- written characters recognition,
M. S. Islam et al., “RATNet: A deep learning model for Bengali hand- written characters recognition,”Multimedia Tools and Applications, vol. 81, no. 8, pp. 10631–10651, 2022
2022
-
[21]
MSBNet: Handwritten Bangla character recog- nition using lightweight multi-scale CNN architecture,
R. Chakraborty et al., “MSBNet: Handwritten Bangla character recog- nition using lightweight multi-scale CNN architecture,” inProc. Int. Conf. Data Science and Management of Data, 2024, pp. 107–115
2024
-
[22]
Handwritten Bangla character recognition using convolutional neural networks: A comparative study and new lightweight model,
M. N. I. Opu et al., “Handwritten Bangla character recognition using convolutional neural networks: A comparative study and new lightweight model,”Neural Computing and Applications, vol. 36, no. 1, pp. 337–348, 2024
2024
-
[23]
Bengali handwritten character recognition en- hanced with morphological preprocessing and attention mechanism- based convolutional neural network,
M. T. Sabira et al., “Bengali handwritten character recognition en- hanced with morphological preprocessing and attention mechanism- based convolutional neural network,” inProc. Int. Conf. Computing Advancements, 2024, pp. 513–519
2024
-
[24]
VashaNet: An automated system for recognizing handwritten Bangla basic characters using deep convolutional neural network,
M. Raquib et al., “VashaNet: An automated system for recognizing handwritten Bangla basic characters using deep convolutional neural network,”Machine Learning with Applications, vol. 17, p. 100568, 2024
2024
-
[25]
VashaNet-V2: Bangla handwritten character recogni- tion using a novel deep convolutional neural network and an extended original dataset,
M. Raquib et al., “VashaNet-V2: Bangla handwritten character recogni- tion using a novel deep convolutional neural network and an extended original dataset,” inProc. Int. Conf. Machine Intelligence and Emerging Technologies, 2024, pp. 491–506
2024
-
[26]
InkSynth: Recognizing Bengali com- pound characters with synthesized data and deep fusion networks,
S. Saha and M. Bhattacharya, “InkSynth: Recognizing Bengali com- pound characters with synthesized data and deep fusion networks,” International Journal on Document Analysis and Recognition, 2025
2025
-
[27]
A vision transformer-based hybrid neural architecture for automated handwritten Bangla character recognition and braille conversion,
T. S. B. Ahmed et al., “A vision transformer-based hybrid neural architecture for automated handwritten Bangla character recognition and braille conversion,”Knowledge-Based Systems, 2025
2025
-
[28]
Quantum-enhanced handwritten Bangla charac- ter recognition: A hybrid quantum classical neural network approach,
M. A. Rahman et al., “Quantum-enhanced handwritten Bangla charac- ter recognition: A hybrid quantum classical neural network approach,” Research Square Preprint, 2026
2026
-
[29]
An adaptive Gaussian filter for noise reduc- tion and edge detection,
G. Deng and L. W. Cahill, “An adaptive Gaussian filter for noise reduc- tion and edge detection,” inProc. IEEE Nuclear Science Symposium, 1993, pp. 1615–1619
1993
-
[30]
Randomized Hough transform (RHT): Basic mechanisms, algorithms, and computational complexities,
L. Xu and E. Oja, “Randomized Hough transform (RHT): Basic mechanisms, algorithms, and computational complexities,”CVGIP: Image Understanding, vol. 57, no. 2, pp. 131–154, 1993
1993
-
[31]
EfficientNet: Rethinking model scaling for convolutional neural networks,
M. Tan and Q. V . Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” inProc. ICML, 2019, pp. 6105–6114
2019
-
[32]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,”arXiv:1704.04861, 2017
work page internal anchor Pith review arXiv 2017
-
[33]
CBAM: Convolutional block attention module,
S. Woo et al., “CBAM: Convolutional block attention module,” inProc. ECCV, 2018, pp. 3–19
2018
-
[34]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,”arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[35]
Attention is all you need,
A. Vaswani et al., “Attention is all you need,” inAdvances in Neural Information Processing Systems, 2017
2017
-
[36]
Gaussian Error Linear Units (GELUs)
D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[37]
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[38]
Conformer: Convolution-augmented transformer for speech recognition,
A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” inProc. Interspeech, 2020, pp. 5036–5040
2020
-
[39]
Batch normalization: Accelerating deep network training by reducing internal covariate shift,
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inProc. ICML, 2015, pp. 448–456
2015
-
[40]
Probabilistic interpretation of feedforward classification network outputs,
J. S. Bridle, “Probabilistic interpretation of feedforward classification network outputs,” inNeurocomputing, Springer, 1990
1990
-
[41]
Goodfellow, Y
I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. MIT Press, 2016
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.