arxiv: 2604.17278 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction

Xueheng Li , Tao Hu , Ke Cao , Runsheng Qi , Huixin Zhang , Rui Li , Jie Zhang , Chengjun Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords pest recognitionvision-language modelmultimodal learningfine-grained classificationagricultural AIRWKV architecturechain-of-thought reasoning

0 comments

The pith

PestVL-Net fuses a RWKV visual pathway with MLLM semantic descriptions to enable fine-grained multimodal pest recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PestVL-Net as a vision-language framework to overcome difficulties in recognizing the wide variety of pest species and their complex morphologies in real agricultural conditions. Its visual component employs the Recurrent Weighted Key Value architecture together with a saliency-guided adaptive window partitioning scheme, while the language component produces detailed pest descriptions by drawing on Multimodal Large Language Model priors shaped by agricultural expert knowledge and multimodal Chain-of-Thought reasoning. The central move is the deep fusion of these two complementary streams, which the authors argue produces more effective modeling of both low-level visual cues and high-level semantics than prior single-modality methods. Experiments across multiple pest datasets are presented to show that this integration yields stronger recognition performance with direct implications for practical pest management.

Core claim

PestVL-Net integrates a Recurrent Weighted Key Value visual encoder that applies saliency-guided adaptive window partitioning to capture fine-grained morphological details with a linguistic pathway that generates precise pest semantic descriptions via Multimodal Large Language Models informed by agricultural expert knowledge and organized through multimodal Chain-of-Thought reasoning. The deep fusion of these visual and textual representations produces fine-grained multimodal pest learning that outperforms existing techniques on multiple pest datasets.

What carries the argument

The dual-pathway fusion in PestVL-Net, where saliency-guided RWKV visual encoding is combined with expert-informed MLLM textual generation structured by multimodal CoT reasoning.

If this is right

Better capture of diverse pest morphologies leads to higher recognition accuracy across multiple datasets.
Improved modeling supports more reliable pest identification in real-world agricultural settings.
The framework offers a practical route to scalable pest management that aids sustainable crop production.
Fusion of visual and semantic streams reduces the limitations of data scarcity typical in pest recognition tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion strategy could be tested on other fine-grained recognition problems where expert textual knowledge is available, such as plant disease or wildlife species identification.
Deployment on edge devices for field use would require checking whether the MLLM component can be distilled without losing the CoT-derived semantic precision.
If the adaptive window scheme proves robust, it might generalize to other vision tasks involving small or variably scaled objects in cluttered scenes.

Load-bearing premise

The saliency-guided adaptive window partitioning together with MLLM priors drawn from agricultural experts and multimodal CoT reasoning will reliably extract the complex and varied morphological features of pests more effectively than earlier single-modality approaches.

What would settle it

A direct head-to-head test on a new pest dataset containing previously unseen species or morphologies in which PestVL-Net shows no accuracy gain over strong vision-only or language-only baselines would falsify the claim of superior fine-grained multimodal learning.

Figures

Figures reproduced from arXiv: 2604.17278 by Chengjun Xie, Huixin Zhang, Jie Zhang, Ke Cao, Rui Li, Runsheng Qi, Tao Hu, Xueheng Li.

**Figure 2.** Figure 2: Pipeline for our PestVL-Net. The network comprises [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overall framework of our proposed PestVL-Net, which comprises several main components: fine-grained visual feature model [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Saliency-guided window partitioning and local scanning [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The detailed computational structure of the Vision [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The progress for generating pest captions leveraging agricultural expert knowledge and multimodal Chain-of-Thought reasoning, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of feature maps generated by distinct mod [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Effective pest recognition and management are crucial for sustainable agricultural development. However, collecting pest data in real scenarios is often challenging. Compared to other domains, pests exhibit a wide variety of species with complex and diverse morphological characteristics. Existing techniques struggle to effectively model the key visual and high-level semantic features of pests in a fine-grained manner. These limitations hinder the practical application of such methods in real agricultural scenarios. To address these critical challenges, we present a synergistic approach that integrates PestVL-Net, a novel vision-language framework, with two multi-species pest datasets to facilitate fine-grained pest learning. The visual pathway of PestVL-Net utilizes the Recurrent Weighted Key Value (RWKV) architecture, incorporating a saliency-guided adaptive window partitioning scheme to effectively model the fine-grained visual characteristics of pests. Concurrently, the linguistic component generates precise pest semantic descriptions by leveraging Multimodal Large Language Models (MLLMs) priors, critically informed by agricultural expert knowledge and structured via multimodal Chain-of-Thought (CoT) reasoning. The deep fusion of these complementary visual and textual representations enables fine-grained multimodal pest learning. Extensive experimental evaluations on multiple pest datasets validate the superior performance of PestVL-Net, highlighting its potential for effective real-world pest management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PestVL-Net applies RWKV with saliency partitioning and expert MLLM CoT to pest recognition plus two new datasets, but the abstract supplies no numbers or baselines to check the superiority claim.

read the letter

The main point is a multimodal framework called PestVL-Net that routes pest images through an RWKV visual backbone using saliency-guided adaptive windows, then fuses that with semantic descriptions generated by MLLMs via expert-informed multimodal chain-of-thought. Two multi-species pest datasets are also released to support the work. This targets the real difficulty of modeling varied pest shapes and semantics where ordinary vision or language models fall short in agricultural settings. The combination is a reasonable extension of existing RWKV and MLLM ideas to a domain that needs fine-grained handling, and the datasets themselves are a usable addition for others working on similar problems. The soft spot is the complete absence of any quantitative support. The text asserts extensive evaluations on multiple datasets show better results, yet no metrics, baselines, ablations, or error bars appear in the provided material. That leaves the central performance claim uncheckable and makes it hard to tell whether the fusion actually improves over prior multimodal or vision-only methods or simply applies them to pests. The paper is aimed at applied computer vision groups focused on agriculture or domain-specific multimodal learning. Readers who need datasets or high-level design patterns for fine-grained recognition in constrained data regimes could extract some value from it. It should receive serious peer review because it presents a complete system and new resources for a practical issue, even if the experimental claims will need close scrutiny and likely revision.

Referee Report

0 major / 2 minor

Summary. The paper proposes PestVL-Net, a novel vision-language framework for fine-grained multimodal pest learning in agriculture. The visual pathway employs an RWKV architecture augmented by a saliency-guided adaptive window partitioning scheme to model complex pest morphologies. The linguistic pathway generates semantic descriptions using Multimodal Large Language Models (MLLMs) informed by agricultural expert knowledge and structured via multimodal Chain-of-Thought (CoT) reasoning. These complementary representations are deeply fused, and the approach is paired with two multi-species pest datasets. The central claim is that this integration enables superior pest recognition performance, as validated by extensive experiments on multiple pest datasets.

Significance. If the performance claims hold, the work could meaningfully advance practical AI applications in sustainable agriculture by tackling data scarcity and morphological diversity in pest recognition. The combination of RWKV for efficient visual feature extraction with domain-expert-informed MLLM priors and CoT reasoning is a targeted adaptation that may outperform prior vision-only or generic multimodal baselines in this specialized domain. The release of associated datasets further supports reproducibility and future research.

minor comments (2)

Abstract: the statement that 'extensive experimental evaluations on multiple pest datasets validate the superior performance' is not accompanied by any quantitative metrics, baseline names, dataset identifiers, or error bars. Adding at least the key accuracy or F1 scores and a brief comparison summary would make the central empirical claim more immediately assessable.
[§3] The description of the 'saliency-guided adaptive window partitioning scheme' and its integration with RWKV would benefit from an explicit algorithmic outline or pseudocode (e.g., in §3.2) to clarify how window sizes are determined from saliency maps and how this differs from standard RWKV or ViT patching.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The recognition of PestVL-Net's potential to advance practical AI applications in sustainable agriculture through efficient RWKV-based visual modeling, expert-informed MLLM priors, and multimodal CoT reasoning is appreciated. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces PestVL-Net as a novel architecture that fuses an RWKV-based visual pathway (with saliency-guided adaptive window partitioning) and MLLM-generated linguistic semantics (via expert-informed multimodal CoT reasoning). Its central claims rest on empirical validation across multiple pest datasets rather than any mathematical derivation, first-principles prediction, or fitted parameter that reduces to the inputs by construction. No self-definitional loops, renamed known results, or load-bearing self-citations appear in the abstract or high-level description; the performance superiority is presented as an experimental outcome, not an algebraic identity or tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Assessment limited to abstract; full details on parameters, assumptions, and entities unavailable. No explicit free parameters, axioms, or invented entities beyond the proposed model itself are described.

pith-pipeline@v0.9.0 · 5533 in / 1154 out tokens · 46810 ms · 2026-05-10T06:24:11.200773+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Genetically engi- neered crops: importance of diversified integrated pest man- agement for agricultural sustainability.Frontiers in bioengi- neering and biotechnology, 7:24, 2019

Jennifer A Anderson, Peter C Ellsworth, Josias C Faria, Gra- ham P Head, Micheal DK Owen, Clinton D Pilcher, An- thony M Shelton, and Michael Meissle. Genetically engi- neered crops: importance of diversified integrated pest man- agement for agricultural sustainability.Frontiers in bioengi- neering and biotechnology, 7:24, 2019. 1

2019
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Hinet: Half instance normalization network for image restoration

Liangyu Chen, Xin Lu, Jie Zhang, Xiaojie Chu, and Cheng- peng Chen. Hinet: Half instance normalization network for image restoration. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 182– 192, 2021. 3

2021
[4]

Emeric Courson et al. Weather and landscape drivers of the regional level of pest occurrence in arable agriculture: a multi-pest analysis at the french national scale.Agriculture, Ecosystems & Environment, 338:108105, 2022. 1

2022
[5]

Stylerwkv: High-quality and high-efficiency style transfer with rwkv- like architecture

Miaomiao Dai, Qianyu Zhou, and Lizhuang Ma. Stylerwkv: High-quality and high-efficiency style transfer with rwkv- like architecture. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 3

2025
[6]

Integrated pest management: good intentions, hard realities

Jean-Philippe Deguine et al. Integrated pest management: good intentions, hard realities. a review.Agronomy for Sus- tainable Development, 41(3):38, 2021. 2

2021
[7]

Exploring real&synthetic dataset and linear attention in image restoration.arXiv preprint arXiv:2412.03814, 2024

Yuzhen Du, Teng Hu, Jiangning Zhang, Ran Yi Cheng- ming Xu, Xiaobin Hu, Kai Wu, Donghao Luo, Yabiao Wang, and Lizhuang Ma. Exploring real&synthetic dataset and linear attention in image restoration.arXiv preprint arXiv:2412.03814, 2024. 2

work page arXiv 2024
[8]

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures, March 2024

Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, and Wenhai Wang. Vision-rwkv: Efficient and scalable vi- sual perception with rwkv-like architectures.arXiv preprint arXiv:2403.02308, 2024. 2, 3, 5

work page arXiv 2024
[9]

Bioscan-5m: A multimodal dataset for insect biodiversity

Zahra Gharaee, Scott C Lowe, ZeMing Gong, Pablo Mil- lan Arias, Nicholas Pellegrino, Austin T Wang, Joakim Brus- lund Haurum, Iuliia Eyriay, Lila Kari, Dirk Steinke, et al. Bioscan-5m: A multimodal dataset for insect biodiversity. Advances in Neural Information Processing Systems, 37: 36285–36313, 2024. 3

2024
[10]

A probabilistic interpretation of precision, recall and f-score, with implication for evaluation

Cyril Goutte et al. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. InEu- ropean conference on information retrieval, pages 345–359. Springer, 2005. 7

2005
[11]

Rwkv-clip: A robust vision-language representation learner

Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, and Jiankang Deng. Rwkv-clip: A robust vision-language representation learner. InProceed- ings of the 2024 Conference on Empirical Methods in Natu- ral Language Processing, pages 4799–4812, 2024. 3

2024
[12]

Crossing multiple life stages: fine-grained classification of agricultural pests.Plant Methods, 20(1): 191, 2024

Yuantao Han, Cong Zhang, Xiaoyun Zhan, Qiuxian Huang, and Zheng Wang. Crossing multiple life stages: fine-grained classification of agricultural pests.Plant Methods, 20(1): 191, 2024. 2

2024
[13]

Deep residual learning for image recog- nition

Kaiming He et al. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6

2016
[14]

Saliency detection: A spectral residual approach

Xiaodi Hou et al. Saliency detection: A spectral residual approach. In2007 IEEE Conference on computer vision and pattern recognition, pages 1–8. Ieee, 2007. 2

2007
[15]

Causality-inspired crop pest recognition based on decoupled feature learning.Pest Man- agement Science, 80(11):5832–5842, 2024

Tao Hu, Jianming Du, Keyu Yan, Wei Dong, Jie Zhang, Jun Wang, and Chengjun Xie. Causality-inspired crop pest recognition based on decoupled feature learning.Pest Man- agement Science, 80(11):5832–5842, 2024. 3, 6

2024
[16]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Dangerous farm insects dataset.https: / / www

Tarun R Jain. Dangerous farm insects dataset.https: / / www . kaggle . com / datasets / tarundalal / dangerous- insects- dataset/data, 2023. Ac- cessed: 2025-07-18. 6

2023
[18]

Ip-vqa dataset: Empowering precision agri- culture with autonomous insect pest management through vi- sual question answering

Kairui Jin, Xing Zi, Karthick Thiyagarajan, Ali Braytee, and Mukesh Prasad. Ip-vqa dataset: Empowering precision agri- culture with autonomous insect pest management through vi- sual question answering. InCompanion Proceedings of the ACM on Web Conference 2025, pages 2000–2007, 2025. 1

2025
[19]

Reformer: The Efficient Transformer

Kitaev et al. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451, 2020. 2

work page internal anchor Pith review arXiv 2001
[20]

arXiv preprint arXiv:1404.5997 , year=

Alex Krizhevsky. One weird trick for parallelizing convo- lutional neural networks.arXiv preprint arXiv:1404.5997,

work page arXiv
[21]

Uniformer: Unified transformer for efficient spatiotemporal representation learning

Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Unified transformer for efficient spatiotemporal representation learning.arXiv preprint arXiv:2201.04676, 2022. 6

work page arXiv 2022
[22]

Uniformer: Unifying convolution and self-attention for visual recogni- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):12581–12600, 2023

Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guan- glu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Unifying convolution and self-attention for visual recogni- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):12581–12600, 2023. 6

2023
[23]

Freq- rwkv: Granularity-aware spatial-frequency synergy via dual- domain recurrent scanning for pan-sharpening

Xueheng Li, Xuanhua He, Tao Hu, Jie Zhang, Man Zhou, Chengjun Xie, Yingying Wang, and Bo Huang. Freq- rwkv: Granularity-aware spatial-frequency synergy via dual- domain recurrent scanning for pan-sharpening. InProceed- ings of the 33rd ACM International Conference on Multime- dia, pages 1890–1899, 2025. 3

2025
[24]

Crop pest recog- nition in natural scenes using convolutional neural net- works.Computers and Electronics in Agriculture, 169: 105174, 2020

Yanfen Li, Hanxiang Wang, L Minh Dang, Abolghasem Sadeghi-Niaraki, and Hyeonjoon Moon. Crop pest recog- nition in natural scenes using convolutional neural net- works.Computers and Electronics in Agriculture, 169: 105174, 2020. 6

2020
[25]

A dataset for forestry pest identification

Bing Liu, Luyang Liu, Ran Zhuo, Weidong Chen, Rui Duan, and Guishen Wang. A dataset for forestry pest identification. Frontiers in Plant Science, 13:857104, 2022. 6

2022
[26]

Auto-adjustment label assignment-based convolutional neural network for oriented wheat diseases detection.Computers and Electronics in Agriculture, 222: 109029, 2024

Haiyun Liu et al. Auto-adjustment label assignment-based convolutional neural network for oriented wheat diseases detection.Computers and Electronics in Agriculture, 222: 109029, 2024. 1

2024
[27]

Plant diseases and pests detection based on deep learning: a review.Plant methods, 17:1–18, 2021

Jun Liu et al. Plant diseases and pests detection based on deep learning: a review.Plant methods, 17:1–18, 2021. 1

2021
[28]

Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024. 6

2024
[29]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 6

2021
[30]

Transxnet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition.IEEE Transactions on Neural Networks and Learning Systems, 2025

Meng Lou, Shu Zhang, Hong-Yu Zhou, Sibei Yang, Chuan Wu, and Yizhou Yu. Transxnet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition.IEEE Transactions on Neural Networks and Learning Systems, 2025. 6

2025
[31]

Understanding the effective receptive field in deep convolu- tional neural networks.Advances in neural information pro- cessing systems, 29, 2016

Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolu- tional neural networks.Advances in neural information pro- cessing systems, 29, 2016. 2

2016
[32]

Insect- foundation: A foundation model and large-scale 1m dataset for visual insect understanding

Hoang-Quan Nguyen, Thanh-Dat Truong, Xuan Bac Nguyen, Ashley Dowling, Xin Li, and Khoa Luu. Insect- foundation: A foundation model and large-scale 1m dataset for visual insect understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21945–21955, 2024. 3

2024
[33]

Deep learning methods for enhanced stress and pest management in market garden crops: A comprehensive anal- ysis.Smart Agricultural Technology, page 100521, 2024

Mireille Gloria Founmilayo Odounfa, Charlemagne DSJ Gbemavo, Souand Peace Gloria Tahi, and Romain L Gl `el`e Kaka¨ı. Deep learning methods for enhanced stress and pest management in market garden crops: A comprehensive anal- ysis.Smart Agricultural Technology, page 100521, 2024. 3

2024
[34]

Insect pest trap development and dl-based pest detec- tion: A comprehensive review.IEEE Transactions on Agri- Food Electronics, 2024

Athanasios Passias, Karolos-Alexandros Tsakalos, Nick Rigogiannis, Dionisis V oglitsis, Nick Papanikolaou, Maria Michalopoulou, George Broufas, and Georgios Ch Sirak- oulis. Insect pest trap development and dl-based pest detec- tion: A comprehensive review.IEEE Transactions on Agri- Food Electronics, 2024. 2

2024
[35]

Rwkv: Reinventing rnns for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023. 3

2023
[36]

A density-point network for dense tiny stored grain pest count- ing.Journal of Stored Products Research, 111:102536, 2025

Runsheng Qi, Rui Li, Jie Zhang, Yi Xia, Jianming Du, Ji- ahui Sun, Chengjun Xie, Hui Zhang, Guangyu Li, et al. A density-point network for dense tiny stored grain pest count- ing.Journal of Stored Products Research, 111:102536, 2025. 3

2025
[37]

Agricultural pests image dataset.https : //www.kaggle.com/datasets/vencerlanz09/ agricultural-pests-image-dataset, 2023

Raijin. Agricultural pests image dataset.https : //www.kaggle.com/datasets/vencerlanz09/ agricultural-pests-image-dataset, 2023. Ac- cessed: 2026-04-05. 6

2023
[38]

Agrobench: Vision-language model benchmark in agriculture

Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2025. 1, 3

2025
[39]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan et al. Very deep convolutional networks for large- scale image recognition.arXiv preprint arXiv:1409.1556,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Rethinking the inception archi- tecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception archi- tecture for computer vision. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2818–2826, 2016. 6

2016
[41]

Attention is all you need.Advances in Neural Information Processing Systems, 2017

A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 3

2017
[42]

Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation

Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, and Li Zhang. Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. InThe eleventh international conference on learning representations, 2023. 6

2023
[43]

Seaformer++: Squeeze-enhanced axial transformer for mobile visual recognition.International Journal of Com- puter Vision, pages 1–22, 2025

Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, and Li Zhang. Seaformer++: Squeeze-enhanced axial transformer for mobile visual recognition.International Journal of Com- puter Vision, pages 1–22, 2025. 6

2025
[44]

Wcg-vmamba: A multi-modal classification model for corn disease.Computers and Electronics in Agriculture, 230: 109835, 2025

Haoyang Wang, Mingfang He, Minge Zhu, and Genhua Liu. Wcg-vmamba: A multi-modal classification model for corn disease.Computers and Electronics in Agriculture, 230: 109835, 2025. 2

2025
[45]

Insectmamba: State space model with adaptive com- posite features for insect recognition

Qianning Wang, Chenglin Wang, Zhixin Lai, and Yucheng Zhou. Insectmamba: State space model with adaptive com- posite features for insect recognition. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025. 3

2025
[46]

Ip102: A large-scale benchmark dataset for insect pest recognition

Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, and Jufeng Yang. Ip102: A large-scale benchmark dataset for insect pest recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8787–8796, 2019. 6, 7

2019
[47]

Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning.Computers and Electronics in Agriculture, 119:123–132, 2015

Chengjun Xie, Jie Zhang, Rui Li, Jinyan Li, Peilin Hong, Junfeng Xia, and Peng Chen. Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning.Computers and Electronics in Agriculture, 119:123–132, 2015. 6

2015
[48]

Quadmamba: Learning quadtree-based selective scan for vi- sual state space model.Advances in Neural Information Pro- cessing Systems, 37:117682–117707, 2024

Fei Xie, Weijia Zhang, Zhongdao Wang, and Chao Ma. Quadmamba: Learning quadtree-based selective scan for vi- sual state space model.Advances in Neural Information Pro- cessing Systems, 37:117682–117707, 2024. 4

2024
[49]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500,
[50]

Restore-rwkv: Efficient and effective med- ical image restoration with rwkv.IEEE Journal of Biomedi- cal and Health Informatics, 2025

Zhiwen Yang, Jiayin Li, Hui Zhang, Dan Zhao, Bingzheng Wei, and Yan Xu. Restore-rwkv: Efficient and effective med- ical image restoration with rwkv.IEEE Journal of Biomedi- cal and Health Informatics, 2025. 2, 3, 5

2025
[51]

Mamba or rwkv: Exploring high-quality and high-efficiency segment anything model.arXiv preprint arXiv:2406.19369, 2024

Haobo Yuan, Xiangtai Li, Lu Qi, Tao Zhang, Ming-Hsuan Yang, Shuicheng Yan, and Chen Change Loy. Mamba or rwkv: Exploring high-quality and high-efficiency segment anything model.arXiv preprint arXiv:2406.19369, 2024. 2, 3

work page arXiv 2024