Recognition: unknown
Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation
Pith reviewed 2026-05-07 17:01 UTC · model grok-4.3
The pith
RDCNet adds random dilated convolutions, multi-branch feature extraction, and context excitation to ResNet-34 to raise image classification accuracy on standard benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present RDCNet, which integrates a Multi-Branch Random Dilated Convolution module that applies varying dilation rates with stochastic masking, a Fine-Grained Feature Enhancement module that links global and local representations via adaptive pooling and bilinear interpolation, and a Context Excitation module that performs softmax-based spatial and channel attention. This combination, built on ResNet-34, produces state-of-the-art accuracies with reported margins of 0.02%, 1.12%, 0.18%, 4.73%, and 3.56% over the second-best methods on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof respectively.
What carries the argument
The RDCNet architecture centered on the MRDC module that uses parallel dilated convolutions with random masking to extract robust multi-scale features, augmented by FGFE for scale bridging and CE for dynamic feature recalibration.
If this is right
- The stochastic masking inside MRDC provides built-in robustness to noise and reduces overfitting risk during multi-scale feature learning.
- FGFE enables the network to emphasize subtle visual patterns by explicitly mixing pooled global context with local details.
- CE dynamically down-weights background interference, improving focus on task-relevant image regions without extra supervision.
- The full combination generalizes across datasets that differ in resolution, class count, and noise characteristics.
- Similar module insertions could be tested on other backbone networks to check whether the gains transfer beyond ResNet-34.
Where Pith is reading between the lines
- The random-masking component might function as a lightweight regularizer that could be detached and inserted into other dilated-convolution designs for added stability.
- These modules may help in downstream tasks such as object detection or segmentation where both multi-scale context and noise suppression matter.
- Even modest benchmark gains could become practically useful in domains with high label noise or variable imaging conditions.
- Validation on larger-scale datasets would reveal whether the observed benefits remain consistent when data volume increases substantially.
Load-bearing premise
The reported accuracy margins arise from the three added modules rather than from any differences in training schedule, hyperparameter choices, or random seed effects.
What would settle it
Retraining the baseline methods under exactly the same protocol and hyperparameters as RDCNet and finding the accuracy gaps disappear, or ablating the MRDC, FGFE, and CE modules individually and observing that performance falls to or below the prior best results.
read the original abstract
Image classification remains a fundamental yet challenging task in computer vision, particularly when fine-grained feature extraction and background noise suppression are required simultaneously. Conventional convolutional neural networks, despite their remarkable success in hierarchical feature learning, often struggle with capturing multi-scale contextual information and are susceptible to overfitting when confronted with noisy or irrelevant image regions. In this paper, we propose RDCNet (Image Classification Network with Random Dilated Convolution), a novel architecture built upon ResNet-34 that integrates three synergistic innovations to address these limitations: (1) a Multi-Branch Random Dilated Convolution (MRDC) module that employs parallel branches with varying dilation rates combined with a stochastic masking mechanism to capture fine-grained features across multiple scales while enhancing robustness against noise and overfitting; (2) a Fine-Grained Feature Enhancement (FGFE) module embedded within MRDC that bridges global contextual information with local feature representations through adaptive pooling and bilinear interpolation, thereby amplifying sensitivity to subtle visual patterns; and (3) a Context Excitation (CE) module that leverages softmax-based spatial attention and channel recalibration to dynamically emphasize task-relevant features while suppressing background interference. Extensive experiments conducted on five benchmark datasets -- CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof -- demonstrate that RDCNet consistently achieves state-of-the-art classification accuracy, outperforming the second-best competing methods by margins of 0.02\%, 1.12\%, 0.18\%, 4.73\%, and 3.56\%, respectively, thereby validating the effectiveness and generalizability of the proposed approach across diverse visual recognition scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce RDCNet, an image classification network built on ResNet-34, featuring a Multi-Branch Random Dilated Convolution (MRDC) module with parallel branches of varying dilation rates and stochastic masking, a Fine-Grained Feature Enhancement (FGFE) module using adaptive pooling and bilinear interpolation, and a Context Excitation (CE) module with softmax-based spatial attention and channel recalibration. Extensive experiments on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof are said to show state-of-the-art results, with RDCNet outperforming the second-best methods by 0.02%, 1.12%, 0.18%, 4.73%, and 3.56% respectively.
Significance. If the reported accuracy improvements hold under controlled conditions, the work would represent a modest incremental advance in CNN design for image classification by combining random multi-scale dilation with feature enhancement and attention mechanisms to better capture fine-grained details while mitigating background noise. The larger gains on Imagenette and Imagewoof suggest potential utility on fine-grained datasets, though the tiny margin on CIFAR-10 limits the overall impact.
major comments (1)
- [Abstract] Abstract: The central claim of state-of-the-art performance with specific margins (0.02% on CIFAR-10 up to 4.73% on Imagenette) is presented without any experimental details, ablation studies (e.g., ResNet-34 + MRDC only vs. full RDCNet), baseline descriptions, hyperparameter settings, data augmentation protocols, or multi-run statistics with error bars. This is load-bearing for the central claim because it prevents verification that the gains are caused by the MRDC/FGFE/CE modules rather than differences in training or random seeds.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We agree that the abstract is highly condensed and could better contextualize the experimental claims to facilitate verification. We address this point below and will make targeted revisions to improve transparency while preserving the abstract's brevity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of state-of-the-art performance with specific margins (0.02% on CIFAR-10 up to 4.73% on Imagenette) is presented without any experimental details, ablation studies (e.g., ResNet-34 + MRDC only vs. full RDCNet), baseline descriptions, hyperparameter settings, data augmentation protocols, or multi-run statistics with error bars. This is load-bearing for the central claim because it prevents verification that the gains are caused by the MRDC/FGFE/CE modules rather than differences in training or random seeds.
Authors: We acknowledge that the abstract, constrained by length, does not enumerate these details. However, the manuscript body provides them comprehensively: Section 4 ('Experiments') specifies all baselines (including ResNet-34 and recent SOTA methods), hyperparameter settings, data augmentation protocols (standard random cropping, horizontal flipping, and normalization), and training procedures. Section 5.3 ('Ablation Studies') directly compares ResNet-34, ResNet-34+MRDC, ResNet-34+MRDC+FGFE, and the full RDCNet to isolate module contributions. To strengthen verifiability, we will revise the abstract to include a concise clause such as 'under standard training protocols with ablations confirming module contributions (see Sections 4 and 5)' and add multi-run statistics with error bars (from 3 independent runs) to the main result tables in the revised manuscript. These changes will confirm that reported gains arise from the proposed MRDC, FGFE, and CE components rather than training variations. revision: partial
Circularity Check
No significant circularity: entirely empirical architecture with no derivation chain or fitted predictions
full rationale
The paper presents an empirical CNN architecture (RDCNet on ResNet-34 backbone) with three proposed modules (MRDC, FGFE, CE) and reports classification accuracies on five datasets. No equations, derivations, uniqueness theorems, or mathematical predictions appear in the provided abstract or description. The central claim is a set of observed accuracy margins, which are experimental outcomes rather than results derived from prior steps within the paper. There are no self-citations invoked as load-bearing premises, no ansatzes smuggled via citation, and no renaming of known results as new derivations. The work is self-contained as a standard empirical contribution whose validity rests on experimental controls (which the skeptic correctly notes are not detailed here), not on any internal reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Masoud Abdi and Saeid Nahavandi. Multi-residual net- works: Improving the speed and accuracy of residual net- works.arXiv preprint arXiv:1609.05672, 2017
-
[2]
Gcnet: Non-local networks meet squeeze-excitation networks and beyond
Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshop, pages 1971–1980, 2019
1971
-
[3]
Frequency-adaptive dilated convolution for semantic seg- mentation
Linwei Chen, Lin Gu, Dezhi Zheng, and Ying Fu. Frequency-adaptive dilated convolution for semantic seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3414– 3425, 2024
2024
-
[4]
Liang-Chieh Chen, George Papandreou, Iasonas Kokki- nos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834– 848, 2018
2018
-
[5]
Rethinking atrous convolution for se- mantic image segmentation
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for se- mantic image segmentation. 2017
2017
-
[6]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Do- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020
work page internal anchor Pith review arXiv 2009
-
[7]
Deformable convolu- tional networks
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolu- tional networks. InProceedings of the IEEE International Conference on Computer Vision, pages 764–773, 2017
2017
-
[8]
Improved regular- ization of convolutional neural networks with cutout
Terrell DeVries and Graham W Taylor. Improved regular- ization of convolutional neural networks with cutout. 2017
2017
-
[9]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInterna- tional Conference on Learning Representations, 2021
2021
-
[10]
Advancing sequential numerical prediction in autoregressive models.Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
Xingjian Fei, Jinghui Lu, Qian Sun, Hao Feng, Yanjie Wang, Wenqiang Shi, Anlong Wang, Jingqun Tang, and Can Huang. Advancing sequential numerical prediction in autoregressive models.Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[11]
Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China In- formation Sciences, 2023
Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wengang Zhou, Houqiang Li, and Can Huang. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China In- formation Sciences, 2023
2023
-
[12]
Hao Feng, Wenqiang Shi, Kaijie Zhang, Xingjian Fei, Lei Liao, Dingkun Yang, Yibo Du, Xiao Wu, Jingqun Tang, Yuliang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026
-
[13]
Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wen- gang Zhou, Houqiang Li, and Can Huang. Unidoc: A uni- versal large multimodal model for simultaneous text de- tection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023
-
[14]
Dolphin: Document image parsing via heterogeneous anchor prompting
Hao Feng, Shuai Wei, Xingjian Fei, Wenqiang Shi, Yuechen Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 21919–21936, 2025
2025
-
[15]
Ghostnet: More features from cheap operations
Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap operations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1577– 1586, 2020
2020
-
[16]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 770–778, 2016
2016
-
[17]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. Mobilenets: Efficient con- volutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017
work page internal anchor Pith review arXiv 2017
-
[18]
Squeeze-and-excitation networks
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7132– 7141, 2018
2018
-
[19]
Densely connected convolutional networks
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 2261–2269, 2017
2017
-
[20]
Ccnet: Criss-cross 10 attention for semantic segmentation
Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross 10 attention for semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 603–612, 2019
2019
-
[21]
Sparse feature image classification network with spatial position correction.Opto-Electronic Engineering, 51(5):240050, 2024
Wentao Jiang, Chen Chen, and Shengchong Zhang. Sparse feature image classification network with spatial position correction.Opto-Electronic Engineering, 51(5):240050, 2024
2024
-
[22]
Double-branch multi-attention mechanism based sharpness-aware classifi- cation network.Pattern Recognition and Artificial Intelli- gence, 36(3):252–267, 2023
Wentao Jiang, Linlin Zhao, and Chao Tu. Double-branch multi-attention mechanism based sharpness-aware classifi- cation network.Pattern Recognition and Artificial Intelli- gence, 36(3):252–267, 2023
2023
-
[23]
When fast fourier transform meets transformer for image restoration
Xueyang Jiang, Xiaohan Zhang, Nan Gao, and Yue Deng. When fast fourier transform meets transformer for image restoration. InProceedings of the European Conference on Computer Vision, pages 381–402. Springer, 2025
2025
-
[24]
Multi-manifold attention for vision transformers.IEEE Access, 11:123433–123444, 2023
Dimitrios Konstantinidis, Ilias Papastratis, Kosmas Dim- itropoulos, and Petros Daras. Multi-manifold attention for vision transformers.IEEE Access, 11:123433–123444, 2023
2023
-
[25]
Imagenet classification with deep convolutional neural net- works.Communications of the ACM, 60(6):84–90, 2017
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.Communications of the ACM, 60(6):84–90, 2017
2017
-
[26]
Couplformer: Rethinking vision transformer with coupling attention
Hao Lan, Xiaohu Wang, Hao Shen, Pengda Liang, and Xian Wei. Couplformer: Rethinking vision transformer with coupling attention. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023
2023
-
[27]
Gradient-based learning applied to document recognition
Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. volume 86, pages 2278–2324. IEEE, 1998
1998
-
[28]
Receptive field block net for accurate and fast object detection
Songtao Liu, Di Huang, and Yunhong Wang. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vi- sion, pages 404–419. Springer, 2018
2018
-
[29]
Spts v2: Single-point scene text spotting.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(12), 2023
Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, et al. Spts v2: Single-point scene text spotting.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(12), 2023
2023
-
[30]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021
2021
-
[31]
Sgdr: Stochastic gra- dient descent with warm restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gra- dient descent with warm restarts. InInternational Confer- ence on Learning Representations, 2017
2017
-
[32]
Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token – interleaving layout and text in a large language model for document understanding.Findings of the Association for Computational Linguistics: ACL 2025, pages 7252–7273, 2025
2025
-
[33]
Rethinking resnets: Improved stacking strategies with high-order schemes for image clas- sification.Complex and Intelligent Systems, 8(4):3395– 3407, 2022
Zhibo Luo, Zhitao Sun, Weilun Zhou, Zhengzhong Wu, and Sei-ichiro Kamata. Rethinking resnets: Improved stacking strategies with high-order schemes for image clas- sification.Complex and Intelligent Systems, 8(4):3395– 3407, 2022
2022
-
[34]
Shufflenet v2: Practical guidelines for efficient cnn architecture design
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. InProceedings of the European Con- ference on Computer Vision, pages 116–131, 2018
2018
-
[35]
Mobilenetv2: In- verted residuals and linear bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: In- verted residuals and linear bottlenecks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018
2018
-
[36]
Biluo Shan, Xingjian Fei, Wenqiang Shi, Anlong Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, and Can Huang. Mctbench: Multimodal cognition to- wards text-rich visual scenes benchmark.arXiv preprint arXiv:2410.11538, 2024
-
[37]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2015
work page internal anchor Pith review arXiv 2015
-
[38]
Dropout: A sim- ple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A sim- ple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014
1929
-
[39]
Attentive eraser: Unleashing diffusion model’s ob- ject removal potential via self-attention redirection guid- ance
Wenhao Sun, Benlei Cui, Jingqun Tang, and Xue-Mei Dong. Attentive eraser: Unleashing diffusion model’s ob- ject removal potential via self-attention redirection guid- ance. InProceedings of the AAAI Conference on Artificial Intelligence, 2024
2024
-
[40]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser- manet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015
2015
-
[41]
Efficientnet: Rethinking model scaling for convolutional neural networks,
Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks.arXiv preprint arXiv:1905.11946, 2020
-
[42]
Character recog- nition competition for street view shop signs.National Sci- ence Review, 10(6):nwad141, 2023
Jingqun Tang, Wei Du, Bo Wang, Wengang Zhou, Songlin Mei, Tian Xue, Xin Xu, and Hao Zhang. Character recog- nition competition for street view shop signs.National Sci- ence Review, 10(6):nwad141, 2023
2023
-
[43]
TextSquare: Scaling up text-centric visual instruction tuning,
Jingqun Tang, Chunhui Lin, Zhen Zhao, Shuai Wei, Binghong Wu, Qi Liu, Yue He, Keren Lu, Hao Feng, Yu- liang Li, et al. Textsquare: Scaling up text-centric vi- sual instruction tuning.arXiv preprint arXiv:2404.12803, 2024
-
[44]
Mtvqa: Benchmarking multilingual text-centric vi- sual question answering.Findings of the Association for Computational Linguistics: ACL 2025, pages 7748–7763, 2025
Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shuai Wei, Anlong Wang, Chunhui Lin, Hao Feng, Zhen Zhao, et al. Mtvqa: Benchmarking multilingual text-centric vi- sual question answering.Findings of the Association for Computational Linguistics: ACL 2025, pages 7748–7763, 2025
2025
-
[45]
Optimal boxes: Boost- ing end-to-end scene text recognition by adjusting anno- tated bounding boxes via reinforcement learning
Jingqun Tang, Wenming Qian, Luchuan Song, Xiena Dong, Lan Li, and Xiang Bai. Optimal boxes: Boost- ing end-to-end scene text recognition by adjusting anno- tated bounding boxes via reinforcement learning. InPro- ceedings of the European Conference on Computer Vision, pages 233–248. Springer, 2022. 11
2022
-
[46]
You can even an- notate text with voice: Transcription-only-supervised text spotting
Jingqun Tang, Shaohua Qiao, Benlei Cui, Yuhang Ma, Sheng Zhang, and Dimitrios Kanoulas. You can even an- notate text with voice: Transcription-only-supervised text spotting. InProceedings of the 30th ACM International Conference on Multimedia, pages 4154–4163, 2022
2022
-
[47]
Few could be better than all: Feature sampling and grouping for scene text detection
Jingqun Tang, Wenqing Zhang, Hongye Liu, Min-Kuang Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 4563–4572, 2022
2022
-
[48]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAd- vances in Neural Information Processing Systems, pages 6000–6010, 2017
2017
-
[49]
Pargo: Bridging vision-language with partial and global views
Anlong Wang, Biluo Shan, Wenqiang Shi, Kun-Yu Lin, Xingjian Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, et al. Pargo: Bridging vision-language with partial and global views. InProceedings of the AAAI Conference on Artificial Intelligence, 2024
2024
-
[50]
Anlong Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xingjian Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. Wilddoc: How far are we from achieving comprehensive and robust document understanding in the wild?Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025
2025
-
[51]
Han Wang, Yongjie Ye, Bingqi Li, Yuhang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, and Can Huang. Vision as lora.arXiv preprint arXiv:2503.20680, 2025
-
[52]
Dual-path processing network for high- resolution salient object detection.Applied Intelligence, 52(10):12034–12048, 2022
Jian Wang, Qiping Yang, Shiqiang Yang, Xiuli Chai, and Wenjie Zhang. Dual-path processing network for high- resolution salient object detection.Applied Intelligence, 52(10):12034–12048, 2022
2022
-
[53]
Non-local neural networks
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018
2018
-
[54]
Cbam: Convolutional block attention mod- ule
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention mod- ule. InProceedings of the European Conference on Com- puter Vision, pages 3–19, 2018
2018
-
[55]
Auto-train-once: Controller network guided au- tomatic network pruning from scratch
Xidong Wu, Shangqian Gao, Zeyu Zhang, Zhenzhen Li, Runxue Bao, Yanfu Zhang, Xiaoqian Wang, and Heng Huang. Auto-train-once: Controller network guided au- tomatic network pruning from scratch. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 16163–16173, 2024
2024
-
[56]
Multi-Scale Context Aggregation by Dilated Convolutions
Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions.arXiv preprint arXiv:1511.07122, 2016
work page Pith review arXiv 2016
-
[57]
Cutmix: Reg- ularization strategy to train strong classifiers with localiz- able features
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Reg- ularization strategy to train strong classifiers with localiz- able features. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 6023–6032, 2019
2019
-
[58]
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2017
work page internal anchor Pith review arXiv 2017
-
[59]
mixup: Beyond empirical risk mini- mization
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk mini- mization. InInternational Conference on Learning Repre- sentations, 2018
2018
-
[60]
Tabpedia: Towards comprehensive visual table understanding with concept synergy
Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shuai Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Wen- gang Zhou, et al. Tabpedia: Towards comprehensive visual table understanding with concept synergy. InAdvances in Neural Information Processing Systems, 2024
2024
-
[61]
Multi-modal in-context learning makes an ego-evolving scene text recognizer
Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Hao Liu, Zitao Zhang, Xin Tan, Can Huang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recognizer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
2023
-
[62]
Harmonizing visual text comprehension and gener- ation
Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shuai Wei, Hao Liu, Xin Tan, Zitao Zhang, Can Huang, et al. Harmonizing visual text comprehension and gener- ation. InAdvances in Neural Information Processing Sys- tems, 2024
2024
-
[63]
Random erasing data augmentation.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 34(7):13001–13008, 2020
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 34(7):13001–13008, 2020
2020
-
[64]
Qkformer: Hierarchical spiking transformer using q-k attention.arXiv preprint arXiv:2403.16552, 2024
Chenlin Zhou, Han Zhang, Zhaokun Zhou, Liutao Yu, Li- wei Huang, Xiaopeng Fan, Li Yuan, Zhengyu Ma, Hui- hui Zhou, and Yonghong Tian. Qkformer: Hierarchical spiking transformer using q-k attention.arXiv preprint arXiv:2403.16552, 2024
-
[65]
Hao Zhu, Yuliang Liu, Xiao Wu, Anlong Wang, Hao Feng, Dingkun Yang, Chen Feng, Can Huang, Jingqun Tang, et al. Textpecker: Rewarding structural anomaly quantifi- cation for enhancing visual text rendering.arXiv preprint arXiv:2602.20903, 2026
-
[66]
Asymmetric non-local neural networks for se- mantic segmentation
Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. Asymmetric non-local neural networks for se- mantic segmentation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 593– 602, 2019. 12
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.