pith. machine review for the scientific record. sign in

arxiv: 2604.25188 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords image classificationdilated convolutionmulti-branch architecturecontext attentionfeature enhancementResNetcomputer vision
0
0 comments X

The pith

RDCNet adds random dilated convolutions, multi-branch feature extraction, and context excitation to ResNet-34 to raise image classification accuracy on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes RDCNet as a modified ResNet-34 that incorporates three modules to handle multi-scale features and suppress background noise better than prior convolutional networks. The MRDC module runs parallel branches with different dilation rates plus random masking to gather fine details across scales without overfitting. FGFE connects global context to local patterns through pooling and interpolation, while CE uses attention to boost relevant channels and spatial regions. Tests on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof show small accuracy edges over competing methods. A reader would care because these changes target common weaknesses in visual recognition where scale variation and noise often limit performance.

Core claim

The authors present RDCNet, which integrates a Multi-Branch Random Dilated Convolution module that applies varying dilation rates with stochastic masking, a Fine-Grained Feature Enhancement module that links global and local representations via adaptive pooling and bilinear interpolation, and a Context Excitation module that performs softmax-based spatial and channel attention. This combination, built on ResNet-34, produces state-of-the-art accuracies with reported margins of 0.02%, 1.12%, 0.18%, 4.73%, and 3.56% over the second-best methods on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof respectively.

What carries the argument

The RDCNet architecture centered on the MRDC module that uses parallel dilated convolutions with random masking to extract robust multi-scale features, augmented by FGFE for scale bridging and CE for dynamic feature recalibration.

If this is right

  • The stochastic masking inside MRDC provides built-in robustness to noise and reduces overfitting risk during multi-scale feature learning.
  • FGFE enables the network to emphasize subtle visual patterns by explicitly mixing pooled global context with local details.
  • CE dynamically down-weights background interference, improving focus on task-relevant image regions without extra supervision.
  • The full combination generalizes across datasets that differ in resolution, class count, and noise characteristics.
  • Similar module insertions could be tested on other backbone networks to check whether the gains transfer beyond ResNet-34.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The random-masking component might function as a lightweight regularizer that could be detached and inserted into other dilated-convolution designs for added stability.
  • These modules may help in downstream tasks such as object detection or segmentation where both multi-scale context and noise suppression matter.
  • Even modest benchmark gains could become practically useful in domains with high label noise or variable imaging conditions.
  • Validation on larger-scale datasets would reveal whether the observed benefits remain consistent when data volume increases substantially.

Load-bearing premise

The reported accuracy margins arise from the three added modules rather than from any differences in training schedule, hyperparameter choices, or random seed effects.

What would settle it

Retraining the baseline methods under exactly the same protocol and hyperparameters as RDCNet and finding the accuracy gaps disappear, or ablating the MRDC, FGFE, and CE modules individually and observing that performance falls to or below the prior best results.

read the original abstract

Image classification remains a fundamental yet challenging task in computer vision, particularly when fine-grained feature extraction and background noise suppression are required simultaneously. Conventional convolutional neural networks, despite their remarkable success in hierarchical feature learning, often struggle with capturing multi-scale contextual information and are susceptible to overfitting when confronted with noisy or irrelevant image regions. In this paper, we propose RDCNet (Image Classification Network with Random Dilated Convolution), a novel architecture built upon ResNet-34 that integrates three synergistic innovations to address these limitations: (1) a Multi-Branch Random Dilated Convolution (MRDC) module that employs parallel branches with varying dilation rates combined with a stochastic masking mechanism to capture fine-grained features across multiple scales while enhancing robustness against noise and overfitting; (2) a Fine-Grained Feature Enhancement (FGFE) module embedded within MRDC that bridges global contextual information with local feature representations through adaptive pooling and bilinear interpolation, thereby amplifying sensitivity to subtle visual patterns; and (3) a Context Excitation (CE) module that leverages softmax-based spatial attention and channel recalibration to dynamically emphasize task-relevant features while suppressing background interference. Extensive experiments conducted on five benchmark datasets -- CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof -- demonstrate that RDCNet consistently achieves state-of-the-art classification accuracy, outperforming the second-best competing methods by margins of 0.02\%, 1.12\%, 0.18\%, 4.73\%, and 3.56\%, respectively, thereby validating the effectiveness and generalizability of the proposed approach across diverse visual recognition scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to introduce RDCNet, an image classification network built on ResNet-34, featuring a Multi-Branch Random Dilated Convolution (MRDC) module with parallel branches of varying dilation rates and stochastic masking, a Fine-Grained Feature Enhancement (FGFE) module using adaptive pooling and bilinear interpolation, and a Context Excitation (CE) module with softmax-based spatial attention and channel recalibration. Extensive experiments on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof are said to show state-of-the-art results, with RDCNet outperforming the second-best methods by 0.02%, 1.12%, 0.18%, 4.73%, and 3.56% respectively.

Significance. If the reported accuracy improvements hold under controlled conditions, the work would represent a modest incremental advance in CNN design for image classification by combining random multi-scale dilation with feature enhancement and attention mechanisms to better capture fine-grained details while mitigating background noise. The larger gains on Imagenette and Imagewoof suggest potential utility on fine-grained datasets, though the tiny margin on CIFAR-10 limits the overall impact.

major comments (1)
  1. [Abstract] Abstract: The central claim of state-of-the-art performance with specific margins (0.02% on CIFAR-10 up to 4.73% on Imagenette) is presented without any experimental details, ablation studies (e.g., ResNet-34 + MRDC only vs. full RDCNet), baseline descriptions, hyperparameter settings, data augmentation protocols, or multi-run statistics with error bars. This is load-bearing for the central claim because it prevents verification that the gains are caused by the MRDC/FGFE/CE modules rather than differences in training or random seeds.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We agree that the abstract is highly condensed and could better contextualize the experimental claims to facilitate verification. We address this point below and will make targeted revisions to improve transparency while preserving the abstract's brevity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of state-of-the-art performance with specific margins (0.02% on CIFAR-10 up to 4.73% on Imagenette) is presented without any experimental details, ablation studies (e.g., ResNet-34 + MRDC only vs. full RDCNet), baseline descriptions, hyperparameter settings, data augmentation protocols, or multi-run statistics with error bars. This is load-bearing for the central claim because it prevents verification that the gains are caused by the MRDC/FGFE/CE modules rather than differences in training or random seeds.

    Authors: We acknowledge that the abstract, constrained by length, does not enumerate these details. However, the manuscript body provides them comprehensively: Section 4 ('Experiments') specifies all baselines (including ResNet-34 and recent SOTA methods), hyperparameter settings, data augmentation protocols (standard random cropping, horizontal flipping, and normalization), and training procedures. Section 5.3 ('Ablation Studies') directly compares ResNet-34, ResNet-34+MRDC, ResNet-34+MRDC+FGFE, and the full RDCNet to isolate module contributions. To strengthen verifiability, we will revise the abstract to include a concise clause such as 'under standard training protocols with ablations confirming module contributions (see Sections 4 and 5)' and add multi-run statistics with error bars (from 3 independent runs) to the main result tables in the revised manuscript. These changes will confirm that reported gains arise from the proposed MRDC, FGFE, and CE components rather than training variations. revision: partial

Circularity Check

0 steps flagged

No significant circularity: entirely empirical architecture with no derivation chain or fitted predictions

full rationale

The paper presents an empirical CNN architecture (RDCNet on ResNet-34 backbone) with three proposed modules (MRDC, FGFE, CE) and reports classification accuracies on five datasets. No equations, derivations, uniqueness theorems, or mathematical predictions appear in the provided abstract or description. The central claim is a set of observed accuracy margins, which are experimental outcomes rather than results derived from prior steps within the paper. There are no self-citations invoked as load-bearing premises, no ansatzes smuggled via citation, and no renaming of known results as new derivations. The work is self-contained as a standard empirical contribution whose validity rests on experimental controls (which the skeptic correctly notes are not detailed here), not on any internal reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is purely empirical and introduces no mathematical axioms, free parameters, or postulated physical entities; the three modules are architectural design choices whose effectiveness is asserted via benchmark numbers.

pith-pipeline@v0.9.0 · 5596 in / 1449 out tokens · 59956 ms · 2026-05-07T17:01:23.115538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    Multi-residual net- works: Improving the speed and accuracy of residual net- works.arXiv preprint arXiv:1609.05672, 2017

    Masoud Abdi and Saeid Nahavandi. Multi-residual net- works: Improving the speed and accuracy of residual net- works.arXiv preprint arXiv:1609.05672, 2017

  2. [2]

    Gcnet: Non-local networks meet squeeze-excitation networks and beyond

    Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshop, pages 1971–1980, 2019

  3. [3]

    Frequency-adaptive dilated convolution for semantic seg- mentation

    Linwei Chen, Lin Gu, Dezhi Zheng, and Ying Fu. Frequency-adaptive dilated convolution for semantic seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3414– 3425, 2024

  4. [4]

    Liang-Chieh Chen, George Papandreou, Iasonas Kokki- nos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834– 848, 2018

  5. [5]

    Rethinking atrous convolution for se- mantic image segmentation

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for se- mantic image segmentation. 2017

  6. [6]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Do- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020

  7. [7]

    Deformable convolu- tional networks

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolu- tional networks. InProceedings of the IEEE International Conference on Computer Vision, pages 764–773, 2017

  8. [8]

    Improved regular- ization of convolutional neural networks with cutout

    Terrell DeVries and Graham W Taylor. Improved regular- ization of convolutional neural networks with cutout. 2017

  9. [9]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInterna- tional Conference on Learning Representations, 2021

  10. [10]

    Advancing sequential numerical prediction in autoregressive models.Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

    Xingjian Fei, Jinghui Lu, Qian Sun, Hao Feng, Yanjie Wang, Wenqiang Shi, Anlong Wang, Jingqun Tang, and Can Huang. Advancing sequential numerical prediction in autoregressive models.Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  11. [11]

    Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China In- formation Sciences, 2023

    Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wengang Zhou, Houqiang Li, and Can Huang. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China In- formation Sciences, 2023

  12. [12]

    Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

    Hao Feng, Wenqiang Shi, Kaijie Zhang, Xingjian Fei, Lei Liao, Dingkun Yang, Yibo Du, Xiao Wu, Jingqun Tang, Yuliang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

  13. [13]

    UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,

    Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wen- gang Zhou, Houqiang Li, and Can Huang. Unidoc: A uni- versal large multimodal model for simultaneous text de- tection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

  14. [14]

    Dolphin: Document image parsing via heterogeneous anchor prompting

    Hao Feng, Shuai Wei, Xingjian Fei, Wenqiang Shi, Yuechen Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 21919–21936, 2025

  15. [15]

    Ghostnet: More features from cheap operations

    Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap operations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1577– 1586, 2020

  16. [16]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 770–778, 2016

  17. [17]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. Mobilenets: Efficient con- volutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017

  18. [18]

    Squeeze-and-excitation networks

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7132– 7141, 2018

  19. [19]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 2261–2269, 2017

  20. [20]

    Ccnet: Criss-cross 10 attention for semantic segmentation

    Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross 10 attention for semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 603–612, 2019

  21. [21]

    Sparse feature image classification network with spatial position correction.Opto-Electronic Engineering, 51(5):240050, 2024

    Wentao Jiang, Chen Chen, and Shengchong Zhang. Sparse feature image classification network with spatial position correction.Opto-Electronic Engineering, 51(5):240050, 2024

  22. [22]

    Double-branch multi-attention mechanism based sharpness-aware classifi- cation network.Pattern Recognition and Artificial Intelli- gence, 36(3):252–267, 2023

    Wentao Jiang, Linlin Zhao, and Chao Tu. Double-branch multi-attention mechanism based sharpness-aware classifi- cation network.Pattern Recognition and Artificial Intelli- gence, 36(3):252–267, 2023

  23. [23]

    When fast fourier transform meets transformer for image restoration

    Xueyang Jiang, Xiaohan Zhang, Nan Gao, and Yue Deng. When fast fourier transform meets transformer for image restoration. InProceedings of the European Conference on Computer Vision, pages 381–402. Springer, 2025

  24. [24]

    Multi-manifold attention for vision transformers.IEEE Access, 11:123433–123444, 2023

    Dimitrios Konstantinidis, Ilias Papastratis, Kosmas Dim- itropoulos, and Petros Daras. Multi-manifold attention for vision transformers.IEEE Access, 11:123433–123444, 2023

  25. [25]

    Imagenet classification with deep convolutional neural net- works.Communications of the ACM, 60(6):84–90, 2017

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.Communications of the ACM, 60(6):84–90, 2017

  26. [26]

    Couplformer: Rethinking vision transformer with coupling attention

    Hao Lan, Xiaohu Wang, Hao Shen, Pengda Liang, and Xian Wei. Couplformer: Rethinking vision transformer with coupling attention. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

  27. [27]

    Gradient-based learning applied to document recognition

    Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. volume 86, pages 2278–2324. IEEE, 1998

  28. [28]

    Receptive field block net for accurate and fast object detection

    Songtao Liu, Di Huang, and Yunhong Wang. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vi- sion, pages 404–419. Springer, 2018

  29. [29]

    Spts v2: Single-point scene text spotting.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(12), 2023

    Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, et al. Spts v2: Single-point scene text spotting.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(12), 2023

  30. [30]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

  31. [31]

    Sgdr: Stochastic gra- dient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gra- dient descent with warm restarts. InInternational Confer- ence on Learning Representations, 2017

  32. [32]

    Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token – interleaving layout and text in a large language model for document understanding.Findings of the Association for Computational Linguistics: ACL 2025, pages 7252–7273, 2025

  33. [33]

    Rethinking resnets: Improved stacking strategies with high-order schemes for image clas- sification.Complex and Intelligent Systems, 8(4):3395– 3407, 2022

    Zhibo Luo, Zhitao Sun, Weilun Zhou, Zhengzhong Wu, and Sei-ichiro Kamata. Rethinking resnets: Improved stacking strategies with high-order schemes for image clas- sification.Complex and Intelligent Systems, 8(4):3395– 3407, 2022

  34. [34]

    Shufflenet v2: Practical guidelines for efficient cnn architecture design

    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. InProceedings of the European Con- ference on Computer Vision, pages 116–131, 2018

  35. [35]

    Mobilenetv2: In- verted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: In- verted residuals and linear bottlenecks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018

  36. [36]

    MCTBench: Multimodal cognition towards text-rich visual scenes bench- mark.arXiv preprint arXiv:2410.11538, 2024

    Biluo Shan, Xingjian Fei, Wenqiang Shi, Anlong Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, and Can Huang. Mctbench: Multimodal cognition to- wards text-rich visual scenes benchmark.arXiv preprint arXiv:2410.11538, 2024

  37. [37]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2015

  38. [38]

    Dropout: A sim- ple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A sim- ple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

  39. [39]

    Attentive eraser: Unleashing diffusion model’s ob- ject removal potential via self-attention redirection guid- ance

    Wenhao Sun, Benlei Cui, Jingqun Tang, and Xue-Mei Dong. Attentive eraser: Unleashing diffusion model’s ob- ject removal potential via self-attention redirection guid- ance. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

  40. [40]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser- manet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015

  41. [41]

    Efficientnet: Rethinking model scaling for convolutional neural networks,

    Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks.arXiv preprint arXiv:1905.11946, 2020

  42. [42]

    Character recog- nition competition for street view shop signs.National Sci- ence Review, 10(6):nwad141, 2023

    Jingqun Tang, Wei Du, Bo Wang, Wengang Zhou, Songlin Mei, Tian Xue, Xin Xu, and Hao Zhang. Character recog- nition competition for street view shop signs.National Sci- ence Review, 10(6):nwad141, 2023

  43. [43]

    TextSquare: Scaling up text-centric visual instruction tuning,

    Jingqun Tang, Chunhui Lin, Zhen Zhao, Shuai Wei, Binghong Wu, Qi Liu, Yue He, Keren Lu, Hao Feng, Yu- liang Li, et al. Textsquare: Scaling up text-centric vi- sual instruction tuning.arXiv preprint arXiv:2404.12803, 2024

  44. [44]

    Mtvqa: Benchmarking multilingual text-centric vi- sual question answering.Findings of the Association for Computational Linguistics: ACL 2025, pages 7748–7763, 2025

    Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shuai Wei, Anlong Wang, Chunhui Lin, Hao Feng, Zhen Zhao, et al. Mtvqa: Benchmarking multilingual text-centric vi- sual question answering.Findings of the Association for Computational Linguistics: ACL 2025, pages 7748–7763, 2025

  45. [45]

    Optimal boxes: Boost- ing end-to-end scene text recognition by adjusting anno- tated bounding boxes via reinforcement learning

    Jingqun Tang, Wenming Qian, Luchuan Song, Xiena Dong, Lan Li, and Xiang Bai. Optimal boxes: Boost- ing end-to-end scene text recognition by adjusting anno- tated bounding boxes via reinforcement learning. InPro- ceedings of the European Conference on Computer Vision, pages 233–248. Springer, 2022. 11

  46. [46]

    You can even an- notate text with voice: Transcription-only-supervised text spotting

    Jingqun Tang, Shaohua Qiao, Benlei Cui, Yuhang Ma, Sheng Zhang, and Dimitrios Kanoulas. You can even an- notate text with voice: Transcription-only-supervised text spotting. InProceedings of the 30th ACM International Conference on Multimedia, pages 4154–4163, 2022

  47. [47]

    Few could be better than all: Feature sampling and grouping for scene text detection

    Jingqun Tang, Wenqing Zhang, Hongye Liu, Min-Kuang Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 4563–4572, 2022

  48. [48]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAd- vances in Neural Information Processing Systems, pages 6000–6010, 2017

  49. [49]

    Pargo: Bridging vision-language with partial and global views

    Anlong Wang, Biluo Shan, Wenqiang Shi, Kun-Yu Lin, Xingjian Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, et al. Pargo: Bridging vision-language with partial and global views. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

  50. [50]

    Anlong Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xingjian Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. Wilddoc: How far are we from achieving comprehensive and robust document understanding in the wild?Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  51. [51]

    Vision as LoRA,

    Han Wang, Yongjie Ye, Bingqi Li, Yuhang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, and Can Huang. Vision as lora.arXiv preprint arXiv:2503.20680, 2025

  52. [52]

    Dual-path processing network for high- resolution salient object detection.Applied Intelligence, 52(10):12034–12048, 2022

    Jian Wang, Qiping Yang, Shiqiang Yang, Xiuli Chai, and Wenjie Zhang. Dual-path processing network for high- resolution salient object detection.Applied Intelligence, 52(10):12034–12048, 2022

  53. [53]

    Non-local neural networks

    Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018

  54. [54]

    Cbam: Convolutional block attention mod- ule

    Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention mod- ule. InProceedings of the European Conference on Com- puter Vision, pages 3–19, 2018

  55. [55]

    Auto-train-once: Controller network guided au- tomatic network pruning from scratch

    Xidong Wu, Shangqian Gao, Zeyu Zhang, Zhenzhen Li, Runxue Bao, Yanfu Zhang, Xiaoqian Wang, and Heng Huang. Auto-train-once: Controller network guided au- tomatic network pruning from scratch. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 16163–16173, 2024

  56. [56]

    Multi-Scale Context Aggregation by Dilated Convolutions

    Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions.arXiv preprint arXiv:1511.07122, 2016

  57. [57]

    Cutmix: Reg- ularization strategy to train strong classifiers with localiz- able features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Reg- ularization strategy to train strong classifiers with localiz- able features. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 6023–6032, 2019

  58. [58]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2017

  59. [59]

    mixup: Beyond empirical risk mini- mization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk mini- mization. InInternational Conference on Learning Repre- sentations, 2018

  60. [60]

    Tabpedia: Towards comprehensive visual table understanding with concept synergy

    Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shuai Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Wen- gang Zhou, et al. Tabpedia: Towards comprehensive visual table understanding with concept synergy. InAdvances in Neural Information Processing Systems, 2024

  61. [61]

    Multi-modal in-context learning makes an ego-evolving scene text recognizer

    Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Hao Liu, Zitao Zhang, Xin Tan, Can Huang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recognizer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  62. [62]

    Harmonizing visual text comprehension and gener- ation

    Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shuai Wei, Hao Liu, Xin Tan, Zitao Zhang, Can Huang, et al. Harmonizing visual text comprehension and gener- ation. InAdvances in Neural Information Processing Sys- tems, 2024

  63. [63]

    Random erasing data augmentation.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 34(7):13001–13008, 2020

    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 34(7):13001–13008, 2020

  64. [64]

    Qkformer: Hierarchical spiking transformer using q-k attention.arXiv preprint arXiv:2403.16552, 2024

    Chenlin Zhou, Han Zhang, Zhaokun Zhou, Liutao Yu, Li- wei Huang, Xiaopeng Fan, Li Yuan, Zhengyu Ma, Hui- hui Zhou, and Yonghong Tian. Qkformer: Hierarchical spiking transformer using q-k attention.arXiv preprint arXiv:2403.16552, 2024

  65. [65]

    Textpecker: Rewarding structural anomaly quantifica- tion for enhancing visual text rendering.arXiv preprint arXiv:2602.20903, 2026

    Hao Zhu, Yuliang Liu, Xiao Wu, Anlong Wang, Hao Feng, Dingkun Yang, Chen Feng, Can Huang, Jingqun Tang, et al. Textpecker: Rewarding structural anomaly quantifi- cation for enhancing visual text rendering.arXiv preprint arXiv:2602.20903, 2026

  66. [66]

    Asymmetric non-local neural networks for se- mantic segmentation

    Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. Asymmetric non-local neural networks for se- mantic segmentation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 593– 602, 2019. 12