arxiv: 2605.07338 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

Ziheng Zhou , Yang Wang , Nan Wang , Chengliang Wu , Jun Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords ShellfishNetmarine molluscsunderwater image benchmarkecological monitoringvision model evaluationimage corruption robustnessfine-grained visual categorization

0 comments

The pith

ShellfishNet supplies a benchmark of 8691 real underwater images across 32 mollusc taxa to test vision models for ecological monitoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors built ShellfishNet because existing marine datasets do not reflect the variable lighting, postures, and degradations that occur in actual underwater settings where shellfish biodiversity is declining. They assembled the collection from field photos and web sources, added caption annotations to a subset, and ran systematic tests on 80 models spanning CNNs, transformers, state-space models, and self-supervised methods. They further measured robustness by applying simulated corruptions that mimic turbidity and severe weather. The work supplies both a data resource and an evaluation protocol so that model choices for practical benthic monitoring can rest on performance under conditions closer to deployment than generic benchmarks allow.

Core claim

By releasing ShellfishNet, a dataset of 8691 images spanning 32 taxa collected under real marine conditions together with caption annotations and a set of corruption benchmarks that replicate common underwater degradations, the paper enables direct comparison of CNNs, Vision Transformers, state-space models, fine-grained visual categorization methods, and multimodal large language models on tasks that matter for automated ecological monitoring of benthic organisms.

What carries the argument

The ShellfishNet dataset itself, whose construction from field and web images plus targeted corruption tests supplies the evaluation platform for model robustness in variable underwater environments.

If this is right

Architectures that maintain accuracy under the introduced corruption suite become preferred candidates for field-deployed monitoring systems.
Captioning performance of multimodal models on the annotated subset indicates which approaches can generate usable descriptions for automated ecological reports.
Fine-grained categorization results highlight the difficulty of distinguishing closely related taxa under realistic image conditions.
The benchmark supplies a common testbed that future work can use to measure incremental gains in underwater species recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could be extended with temporal sequences or multi-view imagery to test tracking and 3-D recognition methods.
Integration with other coastal monitoring platforms might allow models trained on ShellfishNet to contribute to broader biodiversity dashboards.
If performance gaps persist across model families, the results would motivate collection of even larger domain-specific pre-training corpora from similar marine environments.

Load-bearing premise

The chosen field and web images plus the selected corruption tests already represent enough of the variability that real underwater monitoring will encounter, so that higher scores on this benchmark will predict better performance when models are deployed in the wild.

What would settle it

A new set of field videos from unrepresented locations or seasons in which the top-performing models on ShellfishNet show substantially lower accuracy than their benchmark scores would predict.

Figures

Figures reproduced from arXiv: 2605.07338 by Chengliang Wu, Jun Yan, Nan Wang, Yang Wang, Ziheng Zhou.

**Figure 2.** Figure 2: Distribution of taxa within the ShellfishNet dataset. The donut chart visualizes the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of long-tail classification accuracy across various models. The bar chart [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Quantitative comparison of BERTScore-F1 and CLIP Score for the evaluated MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Visual examples of the ShellfishNet data augmentation pipeline. The baseline raw image (a) [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison for Babylonia areolata. Qwen2.5-7B suffers from visual hallucinations, falsely claiming the shells are empty and failing to recognize the specific bowl environment and culinary context captured by the ground truth. black bowl with a sprig of parsley). In contrast, Qwen2.5-7B, despite its high quantitative semantic scores, exhibits clear visual hallucinations and generalization error… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison for Argopecten purpuratus. The smaller Qwen2-2B model completely fails to follow the structured dimension constraints and produces a severe counting hallucination (“five shells in a 2x2 grid”). following and basic visual logic. It does not output the required four-dimensional structural format and instead collapses the response into a single superficial paragraph. Most notably, it … view at source ↗

read the original abstract

The decline of global shellfish biodiversity poses a severe threat to coastal ecosystems. Although artificial intelligence (AI) technologies show potential for automated ecological monitoring, existing marine benthic datasets often lack adaptation to the complexities of real underwater environments (e.g., variable lighting conditions and diverse species postures), posing challenges for the robust generalization of vision models in practical ecological monitoring. To address this problem, we construct ShellfishNet, a comprehensive image benchmark dataset designed specifically for real-world ecological monitoring constraints. Comprising 8,691 images across 32 taxa, this dataset includes a curated subset annotated with descriptive captions. It is constructed through field photography and web scraping, encompassing samples from complex real-world environments. Based on this benchmark, we systematically evaluate 80 representative neural network models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), State Space Models (SSMs), and Self-Supervised Learning (SSL) methods. Furthermore, we evaluate the performance of fine-grained visual categorization (FGVC) models and investigate the image captioning capabilities of several mainstream multimodal large language models (MLLMs). Meanwhile, we introduce image corruption benchmark tests to simulate common underwater degradation scenarios (turbidity, severe weather) and assess the robustness of vision models, enabling trustworthy decisions on ecological protection in the wild. ShellfishNet is dedicated to providing a data foundation and a model-evaluation benchmark for the intelligent monitoring of benthic organisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ShellfishNet supplies a practical new benchmark dataset for marine mollusc recognition with broad model testing, though its impact stays limited to the niche.

read the letter

The main thing to know is that the paper builds ShellfishNet, a dataset of 8691 images across 32 mollusc taxa drawn from field photos and web sources, then runs 80 models through it plus corruption tests for underwater conditions like turbidity. This is the core new piece: a domain-specific resource aimed at real ecological monitoring where general marine datasets fall short on variable lighting and postures. The systematic sweep across CNNs, ViTs, state space models, self-supervised methods, fine-grained classifiers, and even multimodal LLMs gives readers a clear picture of what architectures handle this data. Adding caption annotations on a subset and the corruption robustness checks are sensible extensions that match the stated goal of trustworthy decisions in the wild. The work is internally consistent and delivers exactly what a benchmark paper should: concrete data and comparable numbers that others can use or extend. Soft spots are modest and mostly practical. The total size is not large, so class balance and per-taxon diversity matter a lot; web scraping can introduce noise or non-underwater images that weaken the real-world claim. Without seeing the actual accuracy tables or error breakdowns it is hard to judge how much the robustness tests move the needle beyond standard augmentations. If the full paper shows clear baselines against prior benthic datasets and releases the data cleanly, those concerns shrink. This paper is for applied CV researchers working on fine-grained recognition in constrained environments or ecologists building monitoring tools. It will not shift general vision methods but supplies a focused testbed that the right readers can put to use immediately. I would send it for peer review because the dataset and evaluation protocol are verifiable contributions that the community can assess directly once the numbers and data are public.

Referee Report

2 major / 2 minor

Summary. The paper introduces ShellfishNet, a benchmark dataset of 8,691 images spanning 32 marine mollusc taxa, constructed via field photography and web scraping to reflect real-world underwater conditions. It reports systematic evaluation of 80 models (CNNs, ViTs, SSMs, SSL methods), plus FGVC models and MLLM captioning, and introduces corruption tests simulating turbidity and severe weather to assess robustness for ecological monitoring applications.

Significance. If the dataset is publicly released with full annotations and the evaluations demonstrate clear performance patterns or robustness insights under realistic degradations, this could serve as a useful domain-specific resource for computer vision in marine ecology. The inclusion of caption annotations and corruption benchmarks is a positive step toward practical utility; however, the absence of any quantitative results, tables, or error analysis in the provided text limits assessment of whether the benchmark advances the state of the art.

major comments (2)

[Abstract] Abstract: The manuscript states that 80 models are systematically evaluated and that corruption tests assess robustness, yet no accuracy metrics, robustness scores, comparison tables, or key findings are reported. This omission is load-bearing for the central benchmarking claim, as the utility for 'trustworthy decisions on ecological protection' cannot be verified without evidence of model performance.
[Dataset construction] Dataset construction section: The claim that field-collected and web-scraped images plus chosen corruptions 'sufficiently capture the full range of real-world underwater complexities' is asserted without supporting details on collection protocols, class balance, environmental metadata, or validation that the 32 taxa and 8,691 images generalize beyond the sampled conditions; this directly affects the weakest assumption identified in the review.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two headline quantitative results (e.g., best model accuracy or average robustness drop) to give readers an immediate sense of the benchmark's outcomes.
[Abstract] Clarify the size and usage of the 'curated subset annotated with descriptive captions'—how many images, how annotations were generated, and whether they are used only for MLLM evaluation or also for other tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review of our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where appropriate to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript states that 80 models are systematically evaluated and that corruption tests assess robustness, yet no accuracy metrics, robustness scores, comparison tables, or key findings are reported. This omission is load-bearing for the central benchmarking claim, as the utility for 'trustworthy decisions on ecological protection' cannot be verified without evidence of model performance.

Authors: We agree that the abstract would benefit from including key quantitative highlights to make the benchmarking claims more immediately verifiable. The full manuscript contains comprehensive results in the Experiments section, including accuracy tables for the 80 models across categories (CNNs, ViTs, SSMs, SSL), robustness scores under turbidity and weather corruptions, and comparisons with FGVC and MLLM approaches. In the revised version, we have updated the abstract to concisely report main findings, such as the range of top-1 accuracies observed and the general drop in performance under simulated degradations, while preserving brevity. revision: yes
Referee: [Dataset construction] Dataset construction section: The claim that field-collected and web-scraped images plus chosen corruptions 'sufficiently capture the full range of real-world underwater complexities' is asserted without supporting details on collection protocols, class balance, environmental metadata, or validation that the 32 taxa and 8,691 images generalize beyond the sampled conditions; this directly affects the weakest assumption identified in the review.

Authors: We acknowledge that the original dataset construction section lacked sufficient supporting details. The revised manuscript expands this section with explicit information on field photography protocols (including equipment, locations, and conditions), class distribution statistics showing balance across the 32 taxa, available environmental metadata, and a discussion of representativeness grounded in marine ecology references. We have also revised the phrasing of the claim to avoid overstatement, noting that the dataset targets key real-world challenges rather than claiming to capture the full range, and added a limitations paragraph on generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical dataset construction and benchmarking effort with no mathematical derivations, equations, fitted parameters, or load-bearing self-citations. ShellfishNet is assembled via field collection and web scraping, then used to evaluate 80 models plus corruption tests; all reported results are direct measurements on the collected data rather than predictions derived from prior fitted quantities or self-referential definitions. The central claims rest on the existence and properties of the dataset itself, which is externally verifiable and not reduced to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is empirical data collection and model testing rather than new theory; relies on standard CV assumptions about supervised classification and robustness testing.

axioms (1)

domain assumption Neural network models trained on the provided images can be meaningfully evaluated for species recognition and robustness under simulated underwater degradations.
Invoked when claiming the benchmark enables trustworthy decisions on ecological protection.

pith-pipeline@v0.9.0 · 5560 in / 1265 out tokens · 56260 ms · 2026-05-11T01:07:51.696067+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

113 extracted references · 19 canonical work pages · 12 internal anchors

[1]

Xcit: Cross- covariance image transformers.Advances in neural information processing systems, 34: 20014–20027, 2021

Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross- covariance image transformers.Advances in neural information processing systems, 34: 20014–20027, 2021

2021
[2]

Qwen 3.6 plus: A large language model api

Alibaba Cloud. Qwen 3.6 plus: A large language model api. https://dashscope.aliyun. com/, 2026

2026
[3]

Biomonitoring 2.0: a new paradigm in ecosystem assessment made possible by next-generation dna sequencing, 2012

Donald J Baird and Mehrdad Hajibabaei. Biomonitoring 2.0: a new paradigm in ecosystem assessment made possible by next-generation dna sequencing, 2012

2012
[4]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005
[5]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review arXiv 2021
[6]

LambdaNetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021

Irwan Bello. Lambdanetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021

work page arXiv 2021
[7]

High-performance large-scale image recognition without normalization

Andy Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. InInternational conference on machine learning, pages 1059–1071. PMLR, 2021

2021
[8]

EfficientViT: Enhanced linear attention for high- resolution low-computation visual recognition.arXiv preprint arXiv:2205.14756, 3(1), 2022

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Multi-scale linear attention for high-resolution dense prediction.arXiv preprint arXiv:2205.14756, 2022

work page arXiv 2022
[9]

Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020

2020
[10]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

2021
[11]

Repghost: A hardware-efficient ghost module via re-parameterization.arXiv preprint arXiv:2211.06088, 2022

Chengpeng Chen, Zichao Guo, Haien Zeng, Pengfei Xiong, and Jian Dong. Repghost: A hardware-efficient ghost module via re-parameterization.arXiv preprint arXiv:2211.06088, 2022

work page arXiv 2022
[12]

Run, don’t walk: chasing higher flops for faster neural networks

Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, and S-H Gary Chan. Run, don’t walk: chasing higher flops for faster neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12021–12031, 2023

2023
[13]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020
[14]

Big self-supervised models are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020

2020
[15]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning.arXiv preprint arXiv:2003.04297, 2020

work page internal anchor Pith review arXiv 2003
[16]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021

2021
[17]

Dual path networks.Advances in neural information processing systems, 30, 2017

Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks.Advances in neural information processing systems, 30, 2017. 10

2017
[18]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024
[20]

Visformer: The vision-friendly transformer

Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, and Qi Tian. Visformer: The vision-friendly transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 589–598, 2021

2021
[21]

Xception: Deep learning with depthwise separable convolutions

François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251– 1258, 2017

2017
[22]

Twins: Revisiting the design of spatial attention in vision transformers

Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. Advances in neural information processing systems, 34:9355–9366, 2021

2021
[23]

Paddleclas: An image classification toolkit based on paddlepaddle

PaddleClas Contributors. Paddleclas: An image classification toolkit based on paddlepaddle. https://github.com/PaddlePaddle/PaddleClas, 2020

2020
[24]

Repvgg: Making vgg-style convnets great again

Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021

2021
[25]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[26]

Calcium carbonate alters the functional response of coastal sediments to eutrophication-induced acidification.Scientific reports, 9(1):12012, 2019

Tarn P Drylie, Hazel R Needham, Andrew M Lohrer, Adam Hartland, and Conrad A Pilditch. Calcium carbonate alters the functional response of coastal sediments to eutrophication-induced acidification.Scientific reports, 9(1):12012, 2019

2019
[27]

Convit: Improving vision transformers with soft convolutional inductive biases

Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. In International conference on machine learning, pages 2286–2296. PMLR, 2021

2021
[28]

Swinfishnet: A swin transformer-based approach for automatic fish species classification using transfer learning.PLoS One, 20(5):e0322711, 2025

Ebru Ergün. Swinfishnet: A swin transformer-based approach for automatic fish species classification using transfer learning.PLoS One, 20(5):e0322711, 2025

2025
[29]

Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024

2024
[30]

Res2net: A new multi-scale backbone architecture.IEEE transactions on pattern analysis and machine intelligence, 43(2):652–662, 2019

Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture.IEEE transactions on pattern analysis and machine intelligence, 43(2):652–662, 2019

2019
[31]

Gemini 3.1 Pro: Model Card

Google DeepMind. Gemini 3.1 Pro: Model Card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026

2026
[32]

Levit: a vision transformer in convnet’s clothing for faster inference

Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. Levit: a vision transformer in convnet’s clothing for faster inference. InProceedings of the IEEE/CVF international conference on computer vision, pages 12259–12269, 2021

2021
[33]

Rexnet: Diminishing represen- tational bottleneck on convolutional neural network.arXiv preprint arXiv:2007.00992, 6:1, 2020

Dongyoon Han, Sangdoo Yun, Byeongho Heo, and Y Yoo. Rexnet: Diminishing represen- tational bottleneck on convolutional neural network.arXiv preprint arXiv:2007.00992, 6:1, 2020. 11

work page arXiv 2007
[34]

Transformer in transformer.Advances in neural information processing systems, 34:15908–15919, 2021

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.Advances in neural information processing systems, 34:15908–15919, 2021

2021
[35]

Global context vision transformers

Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Global context vision transformers. InInternational conference on machine learning, pages 12633– 12646. PMLR, 2023

2023
[36]

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine- grained visual recognition by chain-of-thought reasoning.arXiv preprint arXiv:2602.07605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[38]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

2020
[39]

Bag of tricks for image classification with convolutional neural networks

Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 558–567, 2019

2019
[40]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019

work page internal anchor Pith review arXiv 1903
[41]

Rethinking spatial dimensions of vision transformers

Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 11936–11945, 2021

2021
[42]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

2021
[43]

Searching for mobilenetv3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314– 1324, 2019

2019
[44]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

2017
[45]

Fishnet: A large- scale dataset and benchmark for fish recognition, detection, and functional trait prediction

Faizan Farooq Khan, Xiang Li, Andrew J Temple, and Mohamed Elhoseiny. Fishnet: A large- scale dataset and benchmark for fish recognition, detection, and functional trait prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 20496– 20506, 2023

2023
[46]

Fishnet++: Analyzing the capabilities of multimodal large language models in marine biology

Faizan Farooq Khan, Yousef Radwan, Eslam Abdelrahman, Abdulwahab Felemban, Aymen Mir, Nico K Michiels, Andrew J Temple, Michael L Berumen, and Mohamed Elhoseiny. Fishnet++: Analyzing the capabilities of multimodal large language models in marine biology. arXiv preprint arXiv:2509.25564, 2025

work page arXiv 2025
[47]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

2013
[48]

What matters when building vision-language models?Advances in Neural Information Processing Systems, 37: 87874–87907, 2024

Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?Advances in Neural Information Processing Systems, 37: 87874–87907, 2024

2024
[49]

An energy and gpu-computation efficient backbone network for real-time object detection

Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu-computation efficient backbone network for real-time object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 12

2019
[50]

Marine spatial planning makes room for offshore aquaculture in crowded coastal waters.Nature communications, 9(1):945, 2018

Sarah E Lester, JM Stevens, RR Gentry, CV Kappel, TW Bell, CJ Costello, SD Gaines, DA Kiefer, CC Maue, JE Rensel, et al. Marine spatial planning makes room for offshore aquaculture in crowded coastal waters.Nature communications, 9(1):945, 2018

2018
[51]

Exploring the underwater world segmentation without extra training

Bingyu Li, Tao Huo, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. Exploring the underwater world segmentation without extra training. InCVPR, pages 1–11, 2026

2026
[52]

Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios

Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, and Xin Pan. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios.arXiv preprint arXiv:2207.05501, 2022

work page arXiv 2022
[53]

Selective kernel networks

Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective kernel networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 510–519, 2019

2019
[54]

Mvitv2: Improved multiscale vision transformers for classification and detection

Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4804–4814, 2022

2022
[55]

Efficientformer: Vision transformers at mobilenet speed.Advances in neural information processing systems, 35:12934–12949, 2022

Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Efficientformer: Vision transformers at mobilenet speed.Advances in neural information processing systems, 35:12934–12949, 2022

2022
[56]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summariza- tion branches out, pages 74–81, 2004

2004
[57]

Progressive neural architecture search

Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European conference on computer vision (ECCV), pages 19–34, 2018

2018
[58]

Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021

Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021

2021
[59]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

2024
[60]

Llavanext: Improved reasoning, ocr, and world knowledge

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge. https://llava-vl.github. io/blog/2024-01-30-llava-next/, 2024

2024
[61]

Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light.IEEE transactions on circuits and systems for video technology, 30(12):4861–4875, 2020

Risheng Liu, Xin Fan, Ming Zhu, Minjun Hou, and Zhongxuan Luo. Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light.IEEE transactions on circuits and systems for video technology, 30(12):4861–4875, 2020

2020
[62]

Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

2024
[63]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022

2022
[64]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

2022
[65]

Rewrite the stars

Xu Ma, Xiyang Dai, Yue Bai, Yizhou Wang, and Yun Fu. Rewrite the stars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5694–5703, 2024. 13

2024
[66]

Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications

Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muhammad Anwer, and Fahad Shahbaz Khan. Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. InEuropean conference on computer vision, pages 3–20. Springer, 2022

2022
[67]

Xnect: Real-time multi-person 3d motion capture with a single rgb camera.Acm Transactions On Graphics (TOG), 39(4):82–1, 2020

Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed El- gharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and Christian Theobalt. Xnect: Real-time multi-person 3d motion capture with a single rgb camera.Acm Transactions On Graphics (TOG), 39(4):82–1, 2020

2020
[68]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026

2026
[69]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[71]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[72]

Design- ing network design spaces

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Design- ing network design spaces. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020

2020
[73]

Tresnet: High performance gpu-dedicated architecture

Tal Ridnik, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad Sharir, and Itamar Fried- man. Tresnet: High performance gpu-dedicated architecture. Inproceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1400–1409, 2021

2021
[74]

Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications

Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. InProceedings of the IEEE/CVF international conference on computer vision, pages 17425–17436, 2023

2023
[75]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[76]

Bottleneck transformers for visual recognition

Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16519–16529, 2021

2021
[77]

Bioclip: A vision foundation model for the tree of life

Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, et al. Bioclip: A vision foundation model for the tree of life. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19412–19424, 2024

2024
[78]

Re- thinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

2016
[79]

Efficientnetv2: Smaller models and faster training

Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. InInterna- tional conference on machine learning, pages 10096–10106. PMLR, 2021

2021
[80]

Ghostnetv2: En- hance cheap operation with long-range attention.Advances in Neural Information Processing Systems, 35:9969–9982, 2022

Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chao Xu, and Yunhe Wang. Ghostnetv2: En- hance cheap operation with long-range attention.Advances in Neural Information Processing Systems, 35:9969–9982, 2022

2022

Showing first 80 references.