arxiv: 2604.09717 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset

Mirza Raquib , Asif Pervez Polok , Kedar Nath Biswas , Farida Siddiqi Prity , Saydul Akbar Murad , Nick Rahimi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords Bangla handwritten character recognitionmulti-head attentioncross-attention fusionhybrid deep learningVision TransformerEfficientNetnew datasetOCR

0 comments

The pith

A hybrid model fuses EfficientNetB3, Vision Transformer and Conformer features via multi-head cross-attention to recognize 78 classes of handwritten Bangla characters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors build a new balanced dataset covering 78 Bangla character classes, each with roughly 650 samples drawn from writers of varied ages, handedness, education levels and socioeconomic backgrounds. They then run three parallel backbones—an EfficientNetB3, a Vision Transformer and a Conformer—and combine their outputs with a multi-head cross-attention layer that lets features interact across branches. This architecture is intended to capture both fine stroke details and broader shape relationships that cause visual confusion in Bangla script. Strong results on the authors’ dataset together with an external benchmark suggest the approach can support more reliable downstream OCR tasks such as document digitization and text processing. Grad-CAM maps are supplied to show which image regions drive each decision.

Core claim

The paper claims that an interaction-aware hybrid network combining EfficientNetB3, Vision Transformer and Conformer branches, fused by multi-head cross-attention, achieves 98.84 percent accuracy on the newly collected 78-class Bangla handwritten character dataset and 96.49 percent on the independent CHBCR benchmark while providing interpretable visualizations through Grad-CAM.

What carries the argument

The multi-head cross-attention fusion mechanism that integrates feature maps from the parallel EfficientNetB3, Vision Transformer and Conformer modules.

If this is right

Better handling of composite characters and numerals that share similar visual shapes.
Demonstrated generalization on an external benchmark dataset beyond the training collection.
Public availability of the dataset and code for other researchers to build upon.
Grad-CAM visualizations that reveal which stroke regions the model relies on for each class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion pattern could be tested on other scripts that exhibit high intra-class variation and inter-class similarity.
Higher character-level accuracy may improve downstream systems such as automatic transcription of printed or handwritten Bengali texts.
Collecting data from left- and right-handed writers across age groups may become a practical standard for building robust recognition models.

Load-bearing premise

Performance gains come mainly from the cross-attention interaction rather than simply from using three large models or from the balanced dataset alone.

What would settle it

Train EfficientNetB3, Vision Transformer and Conformer separately on the same dataset and compare their accuracies to the full hybrid model to test whether the fusion step adds measurable value.

Figures

Figures reproduced from arXiv: 2604.09717 by Asif Pervez Polok, Farida Siddiqi Prity, Kedar Nath Biswas, Mirza Raquib, Nick Rahimi, Saydul Akbar Murad.

**Figure 2.** Figure 2: Dataset Preparation Pipeling C. Proposed Methodology Overview We have proposed a Multi-Head Cross-Attention Fusion Network to address limitations in our earlier VashaNet versions [24], [25] models in this study. This works This works should be focused only on basic handwritten Bangla characters. This works applied local feature extraction using CNNs, and were tested on relatively small datasets. The curren… view at source ↗

**Figure 3.** Figure 3: Model Architecture of the Proposed Bangla Handwritten Character recognition Framework [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: represents the performance of the proposed model by classes. The model shows consistently high performance across all evaluation metrics across all classes. The number of classes with values near to 100% shows a high recognition ability. The stroke structures with these characters are distinct and easily differentiated by the model. The findings show that the architecture works well with a large variety of… view at source ↗

**Figure 6.** Figure 6: Grad-CAM Analysis of the proposed framework. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Character recognition is the fundamental part of an optical character recognition (OCR) system. Word recognition, sentence transcription, document digitization, and language processing are some of the higher-order activities that can be done accurately through character recognition. Nonetheless, recognizing handwritten Bangla characters is not an easy task because they are written in different styles with inconsistent stroke patterns and a high degree of visual character resemblance. The datasets available are usually limited in intra-class and inequitable in class distribution. We have constructed a new balanced dataset of Bangla written characters to overcome those problems. This consists of 78 classes and each class has approximately 650 samples. It contains the basic characters, composite (Juktobarno) characters and numerals. The samples were a diverse group comprising a large age range and socioeconomic groups. Elementary and high school students, university students, and professionals are the contributing factors. The sample also has right and left-handed writers. We have further proposed an interaction-aware hybrid deep learning architecture that integrates EfficientNetB3, Vision Transformer, and Conformer modules in parallel. A multi-head cross-attention fusion mechanism enables effective feature interaction across these components. The proposed model achieves 98.84% accuracy on the constructed dataset and 96.49% on the external CHBCR benchmark, demonstrating strong generalization capability. Grad-CAM visualizations further provide interpretability by highlighting discriminative regions. The dataset and source code of this research is publicly available at: https://huggingface.co/MIRZARAQUIB/Bangla_Handwritten_Character_Recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New balanced Bangla HCR dataset is the clear value here, with competitive numbers on internal and external tests, but the hybrid fusion's contribution lacks isolating evidence.

read the letter

The paper's main contribution is a new primary dataset of 78 Bangla handwritten character classes, balanced at roughly 650 samples each and collected from a broad mix of writers including school kids, university students, professionals, and both left- and right-handers. That addresses real gaps in prior resources, which often suffered from imbalance and narrow sampling. They also release the data and code publicly on Hugging Face, which makes the work immediately usable for others building OCR pipelines or digitization tools for Bangla.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a new balanced dataset of Bangla handwritten characters (78 classes, ~650 samples per class collected from diverse demographics including age, socioeconomic status, and handedness) and proposes a hybrid architecture integrating EfficientNetB3, Vision Transformer, and Conformer modules in parallel with a multi-head cross-attention fusion mechanism. It reports 98.84% accuracy on the internal test set of the new dataset and 96.49% on the external CHBCR benchmark, supported by Grad-CAM visualizations for interpretability, with the dataset and code released publicly.

Significance. If the results hold, the work contributes a publicly available, balanced dataset addressing common limitations in existing Bangla character resources and shows competitive performance with apparent generalization across benchmarks. The public release of data and source code supports reproducibility and further research on hybrid attention-based models for low-resource scripts.

major comments (2)

Experimental evaluation section: The central performance claims (98.84% and 96.49%) and attribution to the 'interaction-aware' design are not supported by ablation studies comparing the full multi-head cross-attention fusion model against the three backbones used individually or in simple ensemble/combination without the fusion module. Without these controls, it is not possible to determine whether the reported gains arise primarily from the proposed interaction mechanism.
Dataset and experimental setup sections: The manuscript does not specify the exact train/validation/test split ratios, any class-wise distribution verification beyond the stated balance, data augmentation strategies, or training hyperparameters (optimizer, learning rate, batch size, epochs) used to obtain the reported accuracies. These details are required to assess the reliability and reproducibility of the generalization claim on the external benchmark.

minor comments (2)

Abstract and introduction: The total dataset size and any preprocessing steps (e.g., resizing, normalization) applied to inputs for the three parallel branches are not stated explicitly.
Grad-CAM section: The visualizations would benefit from quantitative metrics (e.g., localization accuracy or comparison to attention maps from individual backbones) rather than qualitative description alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will revise the manuscript accordingly to strengthen the work.

read point-by-point responses

Referee: [—] Experimental evaluation section: The central performance claims (98.84% and 96.49%) and attribution to the 'interaction-aware' design are not supported by ablation studies comparing the full multi-head cross-attention fusion model against the three backbones used individually or in simple ensemble/combination without the fusion module. Without these controls, it is not possible to determine whether the reported gains arise primarily from the proposed interaction mechanism.

Authors: We agree that ablation studies are required to rigorously attribute performance gains to the multi-head cross-attention fusion rather than to the individual backbones or simpler combinations. In the revised manuscript we will add these controls, reporting accuracy for each backbone run individually, for pairwise combinations, and for a simple ensemble without the cross-attention module. The new results will be presented in the experimental evaluation section alongside the existing figures. revision: yes
Referee: [—] Dataset and experimental setup sections: The manuscript does not specify the exact train/validation/test split ratios, any class-wise distribution verification beyond the stated balance, data augmentation strategies, or training hyperparameters (optimizer, learning rate, batch size, epochs) used to obtain the reported accuracies. These details are required to assess the reliability and reproducibility of the generalization claim on the external benchmark.

Authors: We acknowledge that these implementation details were omitted and are essential for reproducibility. The revised manuscript will explicitly state the train/validation/test split ratios, confirm that class balance was verified in each split, describe the data-augmentation pipeline, and list all training hyperparameters (optimizer, learning-rate schedule, batch size, and epoch count). These additions will appear in the Dataset and Experimental Setup sections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracies from standard evaluation

full rationale

The paper reports direct test-set accuracies (98.84% on the new balanced 78-class dataset, 96.49% on external CHBCR) obtained by training the hybrid EfficientNetB3+ViT+Conformer model with multi-head cross-attention. No equations, first-principles derivations, or parameter predictions are presented that reduce to fitted inputs or self-citations by construction. Dataset construction and architecture choices are described explicitly without self-referential loops. Lack of ablations is a separate evidence-strength issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters, invented entities, or ad-hoc axioms beyond standard deep-learning assumptions such as i.i.d. sampling and convergence of gradient-based training.

axioms (2)

domain assumption Training and test samples are drawn from the same underlying distribution of real-world Bangla handwriting.
Required for the reported generalization to the external CHBCR benchmark.
standard math Standard supervised training of the hybrid network converges to a solution that reflects the claimed accuracy.
Implicit in any reported test accuracy for a trained deep model.

pith-pipeline@v0.9.0 · 5617 in / 1260 out tokens · 35077 ms · 2026-05-10T18:36:59.580457+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 4 canonical work pages · 4 internal anchors

[1]

Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR),

J. Memon, M. Sami, R. A. Khan, and M. Uddin, “Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR),”IEEE Access, vol. 8, pp. 142642–142668, 2020

2020
[2]

BNVGLENET: Hypercomplex Bangla handwriting character recognition with hierar- chical class expansion using convolutional neural networks,

J. O. Bappi, M. A. T. Rony, and M. S. Islam, “BNVGLENET: Hypercomplex Bangla handwriting character recognition with hierar- chical class expansion using convolutional neural networks,”Natural Language Processing Journal, vol. 7, p. 100068, 2024

2024
[3]

Handwritten Bangla character recognition: A comprehensive review,

A. R. Chowdhury et al., “Handwritten Bangla character recognition: A comprehensive review,”IEEE Access, vol. 8, pp. 110250–110275, 2020

2020
[4]

Bangla handwritten character recognition using deep learning,

M. M. Rahman et al., “Bangla handwritten character recognition using deep learning,”Applied Intelligence, vol. 49, pp. 3341–3354, 2019

2019
[5]

Improved handwritten digit recognition using convolutional neural networks (CNN),

S. Ahlawat, A. Choudhary, A. Nayyar, S. Singh, and B. Yoon, “Improved handwritten digit recognition using convolutional neural networks (CNN),”Sensors, vol. 20, no. 12, p. 3344, 2020

2020
[6]

Benchmarking on offline handwrit- ten Tamil character recognition using convolutional neural networks,

B. R. Kavitha and C. B. Srimathi, “Benchmarking on offline handwrit- ten Tamil character recognition using convolutional neural networks,” Journal of King Saud University – Computer and Information Sciences, vol. 34, no. 4, pp. 1183–1190, 2022

2022
[7]

UrduDeep- Net: Offline handwritten Urdu character recognition using deep neural network,

F. Mushtaq, M. M. Misgar, M. Kumar, and S. S. Khurana, “UrduDeep- Net: Offline handwritten Urdu character recognition using deep neural network,”Neural Computing and Applications, vol. 33, no. 22, pp. 15229–15252, 2021

2021
[8]

Offline MODI script character recognition using deep learning techniques,

C. Chandankhede and R. Sachdeo, “Offline MODI script character recognition using deep learning techniques,”Multimedia Tools and Applications, 2023

2023
[9]

Performance analysis of state-of-the-art convolutional neural network architectures in Bangla handwritten character recog- nition,

T. Ghosh et al., “Performance analysis of state-of-the-art convolutional neural network architectures in Bangla handwritten character recog- nition,”Pattern Recognition and Image Analysis, vol. 31, no. 1, pp. 60–71, 2021

2021
[10]

Leveraging transfer learning and mobile-enabled convolutional neural networks for improved Arabic handwritten char- acter recognition,

M. El Khayati et al., “Leveraging transfer learning and mobile-enabled convolutional neural networks for improved Arabic handwritten char- acter recognition,”IEEE Access, 2025

2025
[11]

HTR-VT: Handwritten text recognition with vision transformer,

Y . Li, D. Chen, T. Tang, and X. Shen, “HTR-VT: Handwritten text recognition with vision transformer,”Pattern Recognition, vol. 158, p. 110967, 2025

2025
[12]

CHBCR-DB dataset,

I. M. Towhid et al., “CHBCR-DB dataset,”Mendeley Data, 2020

2020
[13]

Characters as graphs: Interpretable handwritten Chinese character recognition via pyramid graph transformer,

J. Gan et al., “Characters as graphs: Interpretable handwritten Chinese character recognition via pyramid graph transformer,”Pattern Recog- nition, vol. 137, p. 109317, 2023

2023
[14]

Integrating CNN and transformer ar- chitectures for superior Arabic printed and handwriting characters classification,

M. R. Al-Maamari et al., “Integrating CNN and transformer ar- chitectures for superior Arabic printed and handwriting characters classification,”Scientific Reports, vol. 15, no. 1, p. 29936, 2025. 14

2025
[15]

Performance analysis of hybrid deep learning framework using a vision transformer and convolutional neural network for handwritten digit recognition,

V . Agrawal et al., “Performance analysis of hybrid deep learning framework using a vision transformer and convolutional neural network for handwritten digit recognition,”MethodsX, vol. 12, p. 102554, 2024

2024
[16]

Advancing multilingual handwritten numeral recogni- tion with attention-driven transfer learning,

A. Fateh et al., “Advancing multilingual handwritten numeral recogni- tion with attention-driven transfer learning,”IEEE Access, vol. 12, pp. 41381–41395, 2024

2024
[17]

A self-attention-based deep architecture for online handwriting recognition,

S. A. Molavi and B. BabaAli, “A self-attention-based deep architecture for online handwriting recognition,”Neural Computing and Applica- tions, vol. 36, no. 27, pp. 17165–17179, 2024

2024
[18]

End-to-end handwritten paragraph text recognition using a vertical attention network,

D. Coquenet, C. Chatelain, and T. Paquet, “End-to-end handwritten paragraph text recognition using a vertical attention network,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 508–524, 2022

2022
[19]

EkushNet: Using convolutional neu- ral network for Bangla handwritten recognition,

A. K. M. S. A. Rabby et al., “EkushNet: Using convolutional neu- ral network for Bangla handwritten recognition,”Procedia Computer Science, vol. 143, pp. 603–610, 2018

2018
[20]

RATNet: A deep learning model for Bengali hand- written characters recognition,

M. S. Islam et al., “RATNet: A deep learning model for Bengali hand- written characters recognition,”Multimedia Tools and Applications, vol. 81, no. 8, pp. 10631–10651, 2022

2022
[21]

MSBNet: Handwritten Bangla character recog- nition using lightweight multi-scale CNN architecture,

R. Chakraborty et al., “MSBNet: Handwritten Bangla character recog- nition using lightweight multi-scale CNN architecture,” inProc. Int. Conf. Data Science and Management of Data, 2024, pp. 107–115

2024
[22]

Handwritten Bangla character recognition using convolutional neural networks: A comparative study and new lightweight model,

M. N. I. Opu et al., “Handwritten Bangla character recognition using convolutional neural networks: A comparative study and new lightweight model,”Neural Computing and Applications, vol. 36, no. 1, pp. 337–348, 2024

2024
[23]

Bengali handwritten character recognition en- hanced with morphological preprocessing and attention mechanism- based convolutional neural network,

M. T. Sabira et al., “Bengali handwritten character recognition en- hanced with morphological preprocessing and attention mechanism- based convolutional neural network,” inProc. Int. Conf. Computing Advancements, 2024, pp. 513–519

2024
[24]

VashaNet: An automated system for recognizing handwritten Bangla basic characters using deep convolutional neural network,

M. Raquib et al., “VashaNet: An automated system for recognizing handwritten Bangla basic characters using deep convolutional neural network,”Machine Learning with Applications, vol. 17, p. 100568, 2024

2024
[25]

VashaNet-V2: Bangla handwritten character recogni- tion using a novel deep convolutional neural network and an extended original dataset,

M. Raquib et al., “VashaNet-V2: Bangla handwritten character recogni- tion using a novel deep convolutional neural network and an extended original dataset,” inProc. Int. Conf. Machine Intelligence and Emerging Technologies, 2024, pp. 491–506

2024
[26]

InkSynth: Recognizing Bengali com- pound characters with synthesized data and deep fusion networks,

S. Saha and M. Bhattacharya, “InkSynth: Recognizing Bengali com- pound characters with synthesized data and deep fusion networks,” International Journal on Document Analysis and Recognition, 2025

2025
[27]

A vision transformer-based hybrid neural architecture for automated handwritten Bangla character recognition and braille conversion,

T. S. B. Ahmed et al., “A vision transformer-based hybrid neural architecture for automated handwritten Bangla character recognition and braille conversion,”Knowledge-Based Systems, 2025

2025
[28]

Quantum-enhanced handwritten Bangla charac- ter recognition: A hybrid quantum classical neural network approach,

M. A. Rahman et al., “Quantum-enhanced handwritten Bangla charac- ter recognition: A hybrid quantum classical neural network approach,” Research Square Preprint, 2026

2026
[29]

An adaptive Gaussian filter for noise reduc- tion and edge detection,

G. Deng and L. W. Cahill, “An adaptive Gaussian filter for noise reduc- tion and edge detection,” inProc. IEEE Nuclear Science Symposium, 1993, pp. 1615–1619

1993
[30]

Randomized Hough transform (RHT): Basic mechanisms, algorithms, and computational complexities,

L. Xu and E. Oja, “Randomized Hough transform (RHT): Basic mechanisms, algorithms, and computational complexities,”CVGIP: Image Understanding, vol. 57, no. 2, pp. 131–154, 1993

1993
[31]

EfficientNet: Rethinking model scaling for convolutional neural networks,

M. Tan and Q. V . Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” inProc. ICML, 2019, pp. 6105–6114

2019
[32]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,”arXiv:1704.04861, 2017

work page internal anchor Pith review arXiv 2017
[33]

CBAM: Convolutional block attention module,

S. Woo et al., “CBAM: Convolutional block attention module,” inProc. ECCV, 2018, pp. 3–19

2018
[34]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,”arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[35]

Attention is all you need,

A. Vaswani et al., “Attention is all you need,” inAdvances in Neural Information Processing Systems, 2017

2017
[36]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Conformer: Convolution-augmented transformer for speech recognition,

A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” inProc. Interspeech, 2020, pp. 5036–5040

2020
[39]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inProc. ICML, 2015, pp. 448–456

2015
[40]

Probabilistic interpretation of feedforward classification network outputs,

J. S. Bridle, “Probabilistic interpretation of feedforward classification network outputs,” inNeurocomputing, Springer, 1990

1990
[41]

Goodfellow, Y

I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. MIT Press, 2016

2016