pith. machine review for the scientific record. sign in

arxiv: 2605.07338 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords ShellfishNetmarine molluscsunderwater image benchmarkecological monitoringvision model evaluationimage corruption robustnessfine-grained visual categorization
0
0 comments X

The pith

ShellfishNet supplies a benchmark of 8691 real underwater images across 32 mollusc taxa to test vision models for ecological monitoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors built ShellfishNet because existing marine datasets do not reflect the variable lighting, postures, and degradations that occur in actual underwater settings where shellfish biodiversity is declining. They assembled the collection from field photos and web sources, added caption annotations to a subset, and ran systematic tests on 80 models spanning CNNs, transformers, state-space models, and self-supervised methods. They further measured robustness by applying simulated corruptions that mimic turbidity and severe weather. The work supplies both a data resource and an evaluation protocol so that model choices for practical benthic monitoring can rest on performance under conditions closer to deployment than generic benchmarks allow.

Core claim

By releasing ShellfishNet, a dataset of 8691 images spanning 32 taxa collected under real marine conditions together with caption annotations and a set of corruption benchmarks that replicate common underwater degradations, the paper enables direct comparison of CNNs, Vision Transformers, state-space models, fine-grained visual categorization methods, and multimodal large language models on tasks that matter for automated ecological monitoring of benthic organisms.

What carries the argument

The ShellfishNet dataset itself, whose construction from field and web images plus targeted corruption tests supplies the evaluation platform for model robustness in variable underwater environments.

If this is right

  • Architectures that maintain accuracy under the introduced corruption suite become preferred candidates for field-deployed monitoring systems.
  • Captioning performance of multimodal models on the annotated subset indicates which approaches can generate usable descriptions for automated ecological reports.
  • Fine-grained categorization results highlight the difficulty of distinguishing closely related taxa under realistic image conditions.
  • The benchmark supplies a common testbed that future work can use to measure incremental gains in underwater species recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could be extended with temporal sequences or multi-view imagery to test tracking and 3-D recognition methods.
  • Integration with other coastal monitoring platforms might allow models trained on ShellfishNet to contribute to broader biodiversity dashboards.
  • If performance gaps persist across model families, the results would motivate collection of even larger domain-specific pre-training corpora from similar marine environments.

Load-bearing premise

The chosen field and web images plus the selected corruption tests already represent enough of the variability that real underwater monitoring will encounter, so that higher scores on this benchmark will predict better performance when models are deployed in the wild.

What would settle it

A new set of field videos from unrepresented locations or seasons in which the top-performing models on ShellfishNet show substantially lower accuracy than their benchmark scores would predict.

Figures

Figures reproduced from arXiv: 2605.07338 by Chengliang Wu, Jun Yan, Nan Wang, Yang Wang, Ziheng Zhou.

Figure 1
Figure 1. Figure 1: Overview of the ShellfishNet dataset and its benchmarking framework. The workflow [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of taxa within the ShellfishNet dataset. The donut chart visualizes the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of long-tail classification accuracy across various models. The bar chart [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Quantitative comparison of BERTScore-F1 and CLIP Score for the evaluated MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual examples of the ShellfishNet data augmentation pipeline. The baseline raw image (a) [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison for Babylonia areolata. Qwen2.5-7B suffers from visual hallucina￾tions, falsely claiming the shells are empty and failing to recognize the specific bowl environment and culinary context captured by the ground truth. black bowl with a sprig of parsley). In contrast, Qwen2.5-7B, despite its high quantitative semantic scores, exhibits clear visual hallucinations and generalization error… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison for Argopecten purpuratus. The smaller Qwen2-2B model com￾pletely fails to follow the structured dimension constraints and produces a severe counting hallucina￾tion (“five shells in a 2x2 grid”). following and basic visual logic. It does not output the required four-dimensional structural format and instead collapses the response into a single superficial paragraph. Most notably, it … view at source ↗
read the original abstract

The decline of global shellfish biodiversity poses a severe threat to coastal ecosystems. Although artificial intelligence (AI) technologies show potential for automated ecological monitoring, existing marine benthic datasets often lack adaptation to the complexities of real underwater environments (e.g., variable lighting conditions and diverse species postures), posing challenges for the robust generalization of vision models in practical ecological monitoring. To address this problem, we construct ShellfishNet, a comprehensive image benchmark dataset designed specifically for real-world ecological monitoring constraints. Comprising 8,691 images across 32 taxa, this dataset includes a curated subset annotated with descriptive captions. It is constructed through field photography and web scraping, encompassing samples from complex real-world environments. Based on this benchmark, we systematically evaluate 80 representative neural network models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), State Space Models (SSMs), and Self-Supervised Learning (SSL) methods. Furthermore, we evaluate the performance of fine-grained visual categorization (FGVC) models and investigate the image captioning capabilities of several mainstream multimodal large language models (MLLMs). Meanwhile, we introduce image corruption benchmark tests to simulate common underwater degradation scenarios (turbidity, severe weather) and assess the robustness of vision models, enabling trustworthy decisions on ecological protection in the wild. ShellfishNet is dedicated to providing a data foundation and a model-evaluation benchmark for the intelligent monitoring of benthic organisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ShellfishNet, a benchmark dataset of 8,691 images spanning 32 marine mollusc taxa, constructed via field photography and web scraping to reflect real-world underwater conditions. It reports systematic evaluation of 80 models (CNNs, ViTs, SSMs, SSL methods), plus FGVC models and MLLM captioning, and introduces corruption tests simulating turbidity and severe weather to assess robustness for ecological monitoring applications.

Significance. If the dataset is publicly released with full annotations and the evaluations demonstrate clear performance patterns or robustness insights under realistic degradations, this could serve as a useful domain-specific resource for computer vision in marine ecology. The inclusion of caption annotations and corruption benchmarks is a positive step toward practical utility; however, the absence of any quantitative results, tables, or error analysis in the provided text limits assessment of whether the benchmark advances the state of the art.

major comments (2)
  1. [Abstract] Abstract: The manuscript states that 80 models are systematically evaluated and that corruption tests assess robustness, yet no accuracy metrics, robustness scores, comparison tables, or key findings are reported. This omission is load-bearing for the central benchmarking claim, as the utility for 'trustworthy decisions on ecological protection' cannot be verified without evidence of model performance.
  2. [Dataset construction] Dataset construction section: The claim that field-collected and web-scraped images plus chosen corruptions 'sufficiently capture the full range of real-world underwater complexities' is asserted without supporting details on collection protocols, class balance, environmental metadata, or validation that the 32 taxa and 8,691 images generalize beyond the sampled conditions; this directly affects the weakest assumption identified in the review.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including one or two headline quantitative results (e.g., best model accuracy or average robustness drop) to give readers an immediate sense of the benchmark's outcomes.
  2. [Abstract] Clarify the size and usage of the 'curated subset annotated with descriptive captions'—how many images, how annotations were generated, and whether they are used only for MLLM evaluation or also for other tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review of our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where appropriate to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript states that 80 models are systematically evaluated and that corruption tests assess robustness, yet no accuracy metrics, robustness scores, comparison tables, or key findings are reported. This omission is load-bearing for the central benchmarking claim, as the utility for 'trustworthy decisions on ecological protection' cannot be verified without evidence of model performance.

    Authors: We agree that the abstract would benefit from including key quantitative highlights to make the benchmarking claims more immediately verifiable. The full manuscript contains comprehensive results in the Experiments section, including accuracy tables for the 80 models across categories (CNNs, ViTs, SSMs, SSL), robustness scores under turbidity and weather corruptions, and comparisons with FGVC and MLLM approaches. In the revised version, we have updated the abstract to concisely report main findings, such as the range of top-1 accuracies observed and the general drop in performance under simulated degradations, while preserving brevity. revision: yes

  2. Referee: [Dataset construction] Dataset construction section: The claim that field-collected and web-scraped images plus chosen corruptions 'sufficiently capture the full range of real-world underwater complexities' is asserted without supporting details on collection protocols, class balance, environmental metadata, or validation that the 32 taxa and 8,691 images generalize beyond the sampled conditions; this directly affects the weakest assumption identified in the review.

    Authors: We acknowledge that the original dataset construction section lacked sufficient supporting details. The revised manuscript expands this section with explicit information on field photography protocols (including equipment, locations, and conditions), class distribution statistics showing balance across the 32 taxa, available environmental metadata, and a discussion of representativeness grounded in marine ecology references. We have also revised the phrasing of the claim to avoid overstatement, noting that the dataset targets key real-world challenges rather than claiming to capture the full range, and added a limitations paragraph on generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical dataset construction and benchmarking effort with no mathematical derivations, equations, fitted parameters, or load-bearing self-citations. ShellfishNet is assembled via field collection and web scraping, then used to evaluate 80 models plus corruption tests; all reported results are direct measurements on the collected data rather than predictions derived from prior fitted quantities or self-referential definitions. The central claims rest on the existence and properties of the dataset itself, which is externally verifiable and not reduced to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is empirical data collection and model testing rather than new theory; relies on standard CV assumptions about supervised classification and robustness testing.

axioms (1)
  • domain assumption Neural network models trained on the provided images can be meaningfully evaluated for species recognition and robustness under simulated underwater degradations.
    Invoked when claiming the benchmark enables trustworthy decisions on ecological protection.

pith-pipeline@v0.9.0 · 5560 in / 1265 out tokens · 56260 ms · 2026-05-11T01:07:51.696067+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

113 extracted references · 19 canonical work pages · 12 internal anchors

  1. [1]

    Xcit: Cross- covariance image transformers.Advances in neural information processing systems, 34: 20014–20027, 2021

    Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross- covariance image transformers.Advances in neural information processing systems, 34: 20014–20027, 2021

  2. [2]

    Qwen 3.6 plus: A large language model api

    Alibaba Cloud. Qwen 3.6 plus: A large language model api. https://dashscope.aliyun. com/, 2026

  3. [3]

    Biomonitoring 2.0: a new paradigm in ecosystem assessment made possible by next-generation dna sequencing, 2012

    Donald J Baird and Mehrdad Hajibabaei. Biomonitoring 2.0: a new paradigm in ecosystem assessment made possible by next-generation dna sequencing, 2012

  4. [4]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

  5. [5]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021

  6. [6]

    LambdaNetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021

    Irwan Bello. Lambdanetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021

  7. [7]

    High-performance large-scale image recognition without normalization

    Andy Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. InInternational conference on machine learning, pages 1059–1071. PMLR, 2021

  8. [8]

    EfficientViT: Enhanced linear attention for high- resolution low-computation visual recognition.arXiv preprint arXiv:2205.14756, 3(1), 2022

    Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Multi-scale linear attention for high-resolution dense prediction.arXiv preprint arXiv:2205.14756, 2022

  9. [9]

    Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020

  10. [10]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  11. [11]

    Repghost: A hardware-efficient ghost module via re-parameterization.arXiv preprint arXiv:2211.06088, 2022

    Chengpeng Chen, Zichao Guo, Haien Zeng, Pengfei Xiong, and Jian Dong. Repghost: A hardware-efficient ghost module via re-parameterization.arXiv preprint arXiv:2211.06088, 2022

  12. [12]

    Run, don’t walk: chasing higher flops for faster neural networks

    Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, and S-H Gary Chan. Run, don’t walk: chasing higher flops for faster neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12021–12031, 2023

  13. [13]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  14. [14]

    Big self-supervised models are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020

    Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020

  15. [15]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning.arXiv preprint arXiv:2003.04297, 2020

  16. [16]

    An empirical study of training self-supervised vision transformers

    Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021

  17. [17]

    Dual path networks.Advances in neural information processing systems, 30, 2017

    Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks.Advances in neural information processing systems, 30, 2017. 10

  18. [18]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  19. [19]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  20. [20]

    Visformer: The vision-friendly transformer

    Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, and Qi Tian. Visformer: The vision-friendly transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 589–598, 2021

  21. [21]

    Xception: Deep learning with depthwise separable convolutions

    François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251– 1258, 2017

  22. [22]

    Twins: Revisiting the design of spatial attention in vision transformers

    Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. Advances in neural information processing systems, 34:9355–9366, 2021

  23. [23]

    Paddleclas: An image classification toolkit based on paddlepaddle

    PaddleClas Contributors. Paddleclas: An image classification toolkit based on paddlepaddle. https://github.com/PaddlePaddle/PaddleClas, 2020

  24. [24]

    Repvgg: Making vgg-style convnets great again

    Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021

  25. [25]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  26. [26]

    Calcium carbonate alters the functional response of coastal sediments to eutrophication-induced acidification.Scientific reports, 9(1):12012, 2019

    Tarn P Drylie, Hazel R Needham, Andrew M Lohrer, Adam Hartland, and Conrad A Pilditch. Calcium carbonate alters the functional response of coastal sediments to eutrophication-induced acidification.Scientific reports, 9(1):12012, 2019

  27. [27]

    Convit: Improving vision transformers with soft convolutional inductive biases

    Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. In International conference on machine learning, pages 2286–2296. PMLR, 2021

  28. [28]

    Swinfishnet: A swin transformer-based approach for automatic fish species classification using transfer learning.PLoS One, 20(5):e0322711, 2025

    Ebru Ergün. Swinfishnet: A swin transformer-based approach for automatic fish species classification using transfer learning.PLoS One, 20(5):e0322711, 2025

  29. [29]

    Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024

  30. [30]

    Res2net: A new multi-scale backbone architecture.IEEE transactions on pattern analysis and machine intelligence, 43(2):652–662, 2019

    Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture.IEEE transactions on pattern analysis and machine intelligence, 43(2):652–662, 2019

  31. [31]

    Gemini 3.1 Pro: Model Card

    Google DeepMind. Gemini 3.1 Pro: Model Card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026

  32. [32]

    Levit: a vision transformer in convnet’s clothing for faster inference

    Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. Levit: a vision transformer in convnet’s clothing for faster inference. InProceedings of the IEEE/CVF international conference on computer vision, pages 12259–12269, 2021

  33. [33]

    Rexnet: Diminishing represen- tational bottleneck on convolutional neural network.arXiv preprint arXiv:2007.00992, 6:1, 2020

    Dongyoon Han, Sangdoo Yun, Byeongho Heo, and Y Yoo. Rexnet: Diminishing represen- tational bottleneck on convolutional neural network.arXiv preprint arXiv:2007.00992, 6:1, 2020. 11

  34. [34]

    Transformer in transformer.Advances in neural information processing systems, 34:15908–15919, 2021

    Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.Advances in neural information processing systems, 34:15908–15919, 2021

  35. [35]

    Global context vision transformers

    Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Global context vision transformers. InInternational conference on machine learning, pages 12633– 12646. PMLR, 2023

  36. [36]

    Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

    Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine- grained visual recognition by chain-of-thought reasoning.arXiv preprint arXiv:2602.07605, 2026

  37. [37]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  38. [38]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  39. [39]

    Bag of tricks for image classification with convolutional neural networks

    Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 558–567, 2019

  40. [40]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019

  41. [41]

    Rethinking spatial dimensions of vision transformers

    Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 11936–11945, 2021

  42. [42]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

  43. [43]

    Searching for mobilenetv3

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314– 1324, 2019

  44. [44]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

  45. [45]

    Fishnet: A large- scale dataset and benchmark for fish recognition, detection, and functional trait prediction

    Faizan Farooq Khan, Xiang Li, Andrew J Temple, and Mohamed Elhoseiny. Fishnet: A large- scale dataset and benchmark for fish recognition, detection, and functional trait prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 20496– 20506, 2023

  46. [46]

    Fishnet++: Analyzing the capabilities of multimodal large language models in marine biology

    Faizan Farooq Khan, Yousef Radwan, Eslam Abdelrahman, Abdulwahab Felemban, Aymen Mir, Nico K Michiels, Andrew J Temple, Michael L Berumen, and Mohamed Elhoseiny. Fishnet++: Analyzing the capabilities of multimodal large language models in marine biology. arXiv preprint arXiv:2509.25564, 2025

  47. [47]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013

  48. [48]

    What matters when building vision-language models?Advances in Neural Information Processing Systems, 37: 87874–87907, 2024

    Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?Advances in Neural Information Processing Systems, 37: 87874–87907, 2024

  49. [49]

    An energy and gpu-computation efficient backbone network for real-time object detection

    Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu-computation efficient backbone network for real-time object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 12

  50. [50]

    Marine spatial planning makes room for offshore aquaculture in crowded coastal waters.Nature communications, 9(1):945, 2018

    Sarah E Lester, JM Stevens, RR Gentry, CV Kappel, TW Bell, CJ Costello, SD Gaines, DA Kiefer, CC Maue, JE Rensel, et al. Marine spatial planning makes room for offshore aquaculture in crowded coastal waters.Nature communications, 9(1):945, 2018

  51. [51]

    Exploring the underwater world segmentation without extra training

    Bingyu Li, Tao Huo, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. Exploring the underwater world segmentation without extra training. InCVPR, pages 1–11, 2026

  52. [52]

    Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios

    Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, and Xin Pan. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios.arXiv preprint arXiv:2207.05501, 2022

  53. [53]

    Selective kernel networks

    Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective kernel networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 510–519, 2019

  54. [54]

    Mvitv2: Improved multiscale vision transformers for classification and detection

    Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4804–4814, 2022

  55. [55]

    Efficientformer: Vision transformers at mobilenet speed.Advances in neural information processing systems, 35:12934–12949, 2022

    Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Efficientformer: Vision transformers at mobilenet speed.Advances in neural information processing systems, 35:12934–12949, 2022

  56. [56]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summariza- tion branches out, pages 74–81, 2004

  57. [57]

    Progressive neural architecture search

    Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European conference on computer vision (ECCV), pages 19–34, 2018

  58. [58]

    Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021

    Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021

  59. [59]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  60. [60]

    Llavanext: Improved reasoning, ocr, and world knowledge

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge. https://llava-vl.github. io/blog/2024-01-30-llava-next/, 2024

  61. [61]

    Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light.IEEE transactions on circuits and systems for video technology, 30(12):4861–4875, 2020

    Risheng Liu, Xin Fan, Ming Zhu, Minjun Hou, and Zhongxuan Luo. Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light.IEEE transactions on circuits and systems for video technology, 30(12):4861–4875, 2020

  62. [62]

    Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

    Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

  63. [63]

    Swin transformer v2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022

  64. [64]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  65. [65]

    Rewrite the stars

    Xu Ma, Xiyang Dai, Yue Bai, Yizhou Wang, and Yun Fu. Rewrite the stars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5694–5703, 2024. 13

  66. [66]

    Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications

    Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muhammad Anwer, and Fahad Shahbaz Khan. Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. InEuropean conference on computer vision, pages 3–20. Springer, 2022

  67. [67]

    Xnect: Real-time multi-person 3d motion capture with a single rgb camera.Acm Transactions On Graphics (TOG), 39(4):82–1, 2020

    Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed El- gharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and Christian Theobalt. Xnect: Real-time multi-person 3d motion capture with a single rgb camera.Acm Transactions On Graphics (TOG), 39(4):82–1, 2020

  68. [68]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026

  69. [69]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  70. [70]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  71. [71]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  72. [72]

    Design- ing network design spaces

    Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Design- ing network design spaces. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020

  73. [73]

    Tresnet: High performance gpu-dedicated architecture

    Tal Ridnik, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad Sharir, and Itamar Fried- man. Tresnet: High performance gpu-dedicated architecture. Inproceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1400–1409, 2021

  74. [74]

    Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications

    Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. InProceedings of the IEEE/CVF international conference on computer vision, pages 17425–17436, 2023

  75. [75]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

  76. [76]

    Bottleneck transformers for visual recognition

    Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16519–16529, 2021

  77. [77]

    Bioclip: A vision foundation model for the tree of life

    Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, et al. Bioclip: A vision foundation model for the tree of life. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19412–19424, 2024

  78. [78]

    Re- thinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

  79. [79]

    Efficientnetv2: Smaller models and faster training

    Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. InInterna- tional conference on machine learning, pages 10096–10106. PMLR, 2021

  80. [80]

    Ghostnetv2: En- hance cheap operation with long-range attention.Advances in Neural Information Processing Systems, 35:9969–9982, 2022

    Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chao Xu, and Yunhe Wang. Ghostnetv2: En- hance cheap operation with long-range attention.Advances in Neural Information Processing Systems, 35:9969–9982, 2022

Showing first 80 references.