Recognition: no theorem link
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
Pith reviewed 2026-05-11 01:07 UTC · model grok-4.3
The pith
ShellfishNet supplies a benchmark of 8691 real underwater images across 32 mollusc taxa to test vision models for ecological monitoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By releasing ShellfishNet, a dataset of 8691 images spanning 32 taxa collected under real marine conditions together with caption annotations and a set of corruption benchmarks that replicate common underwater degradations, the paper enables direct comparison of CNNs, Vision Transformers, state-space models, fine-grained visual categorization methods, and multimodal large language models on tasks that matter for automated ecological monitoring of benthic organisms.
What carries the argument
The ShellfishNet dataset itself, whose construction from field and web images plus targeted corruption tests supplies the evaluation platform for model robustness in variable underwater environments.
If this is right
- Architectures that maintain accuracy under the introduced corruption suite become preferred candidates for field-deployed monitoring systems.
- Captioning performance of multimodal models on the annotated subset indicates which approaches can generate usable descriptions for automated ecological reports.
- Fine-grained categorization results highlight the difficulty of distinguishing closely related taxa under realistic image conditions.
- The benchmark supplies a common testbed that future work can use to measure incremental gains in underwater species recognition.
Where Pith is reading between the lines
- The dataset could be extended with temporal sequences or multi-view imagery to test tracking and 3-D recognition methods.
- Integration with other coastal monitoring platforms might allow models trained on ShellfishNet to contribute to broader biodiversity dashboards.
- If performance gaps persist across model families, the results would motivate collection of even larger domain-specific pre-training corpora from similar marine environments.
Load-bearing premise
The chosen field and web images plus the selected corruption tests already represent enough of the variability that real underwater monitoring will encounter, so that higher scores on this benchmark will predict better performance when models are deployed in the wild.
What would settle it
A new set of field videos from unrepresented locations or seasons in which the top-performing models on ShellfishNet show substantially lower accuracy than their benchmark scores would predict.
Figures
read the original abstract
The decline of global shellfish biodiversity poses a severe threat to coastal ecosystems. Although artificial intelligence (AI) technologies show potential for automated ecological monitoring, existing marine benthic datasets often lack adaptation to the complexities of real underwater environments (e.g., variable lighting conditions and diverse species postures), posing challenges for the robust generalization of vision models in practical ecological monitoring. To address this problem, we construct ShellfishNet, a comprehensive image benchmark dataset designed specifically for real-world ecological monitoring constraints. Comprising 8,691 images across 32 taxa, this dataset includes a curated subset annotated with descriptive captions. It is constructed through field photography and web scraping, encompassing samples from complex real-world environments. Based on this benchmark, we systematically evaluate 80 representative neural network models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), State Space Models (SSMs), and Self-Supervised Learning (SSL) methods. Furthermore, we evaluate the performance of fine-grained visual categorization (FGVC) models and investigate the image captioning capabilities of several mainstream multimodal large language models (MLLMs). Meanwhile, we introduce image corruption benchmark tests to simulate common underwater degradation scenarios (turbidity, severe weather) and assess the robustness of vision models, enabling trustworthy decisions on ecological protection in the wild. ShellfishNet is dedicated to providing a data foundation and a model-evaluation benchmark for the intelligent monitoring of benthic organisms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ShellfishNet, a benchmark dataset of 8,691 images spanning 32 marine mollusc taxa, constructed via field photography and web scraping to reflect real-world underwater conditions. It reports systematic evaluation of 80 models (CNNs, ViTs, SSMs, SSL methods), plus FGVC models and MLLM captioning, and introduces corruption tests simulating turbidity and severe weather to assess robustness for ecological monitoring applications.
Significance. If the dataset is publicly released with full annotations and the evaluations demonstrate clear performance patterns or robustness insights under realistic degradations, this could serve as a useful domain-specific resource for computer vision in marine ecology. The inclusion of caption annotations and corruption benchmarks is a positive step toward practical utility; however, the absence of any quantitative results, tables, or error analysis in the provided text limits assessment of whether the benchmark advances the state of the art.
major comments (2)
- [Abstract] Abstract: The manuscript states that 80 models are systematically evaluated and that corruption tests assess robustness, yet no accuracy metrics, robustness scores, comparison tables, or key findings are reported. This omission is load-bearing for the central benchmarking claim, as the utility for 'trustworthy decisions on ecological protection' cannot be verified without evidence of model performance.
- [Dataset construction] Dataset construction section: The claim that field-collected and web-scraped images plus chosen corruptions 'sufficiently capture the full range of real-world underwater complexities' is asserted without supporting details on collection protocols, class balance, environmental metadata, or validation that the 32 taxa and 8,691 images generalize beyond the sampled conditions; this directly affects the weakest assumption identified in the review.
minor comments (2)
- [Abstract] The abstract would be strengthened by including one or two headline quantitative results (e.g., best model accuracy or average robustness drop) to give readers an immediate sense of the benchmark's outcomes.
- [Abstract] Clarify the size and usage of the 'curated subset annotated with descriptive captions'—how many images, how annotations were generated, and whether they are used only for MLLM evaluation or also for other tasks.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed review of our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript states that 80 models are systematically evaluated and that corruption tests assess robustness, yet no accuracy metrics, robustness scores, comparison tables, or key findings are reported. This omission is load-bearing for the central benchmarking claim, as the utility for 'trustworthy decisions on ecological protection' cannot be verified without evidence of model performance.
Authors: We agree that the abstract would benefit from including key quantitative highlights to make the benchmarking claims more immediately verifiable. The full manuscript contains comprehensive results in the Experiments section, including accuracy tables for the 80 models across categories (CNNs, ViTs, SSMs, SSL), robustness scores under turbidity and weather corruptions, and comparisons with FGVC and MLLM approaches. In the revised version, we have updated the abstract to concisely report main findings, such as the range of top-1 accuracies observed and the general drop in performance under simulated degradations, while preserving brevity. revision: yes
-
Referee: [Dataset construction] Dataset construction section: The claim that field-collected and web-scraped images plus chosen corruptions 'sufficiently capture the full range of real-world underwater complexities' is asserted without supporting details on collection protocols, class balance, environmental metadata, or validation that the 32 taxa and 8,691 images generalize beyond the sampled conditions; this directly affects the weakest assumption identified in the review.
Authors: We acknowledge that the original dataset construction section lacked sufficient supporting details. The revised manuscript expands this section with explicit information on field photography protocols (including equipment, locations, and conditions), class distribution statistics showing balance across the 32 taxa, available environmental metadata, and a discussion of representativeness grounded in marine ecology references. We have also revised the phrasing of the claim to avoid overstatement, noting that the dataset targets key real-world challenges rather than claiming to capture the full range, and added a limitations paragraph on generalization. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical dataset construction and benchmarking effort with no mathematical derivations, equations, fitted parameters, or load-bearing self-citations. ShellfishNet is assembled via field collection and web scraping, then used to evaluate 80 models plus corruption tests; all reported results are direct measurements on the collected data rather than predictions derived from prior fitted quantities or self-referential definitions. The central claims rest on the existence and properties of the dataset itself, which is externally verifiable and not reduced to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Neural network models trained on the provided images can be meaningfully evaluated for species recognition and robustness under simulated underwater degradations.
Reference graph
Works this paper leans on
-
[1]
Xcit: Cross- covariance image transformers.Advances in neural information processing systems, 34: 20014–20027, 2021
Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross- covariance image transformers.Advances in neural information processing systems, 34: 20014–20027, 2021
2021
-
[2]
Qwen 3.6 plus: A large language model api
Alibaba Cloud. Qwen 3.6 plus: A large language model api. https://dashscope.aliyun. com/, 2026
2026
-
[3]
Biomonitoring 2.0: a new paradigm in ecosystem assessment made possible by next-generation dna sequencing, 2012
Donald J Baird and Mehrdad Hajibabaei. Biomonitoring 2.0: a new paradigm in ecosystem assessment made possible by next-generation dna sequencing, 2012
2012
-
[4]
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005
2005
-
[5]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021
work page internal anchor Pith review arXiv 2021
-
[6]
Irwan Bello. Lambdanetworks: Modeling long-range interactions without attention.arXiv preprint arXiv:2102.08602, 2021
-
[7]
High-performance large-scale image recognition without normalization
Andy Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. InInternational conference on machine learning, pages 1059–1071. PMLR, 2021
2021
-
[8]
Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Multi-scale linear attention for high-resolution dense prediction.arXiv preprint arXiv:2205.14756, 2022
-
[9]
Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020
2020
-
[10]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021
2021
-
[11]
Chengpeng Chen, Zichao Guo, Haien Zeng, Pengfei Xiong, and Jian Dong. Repghost: A hardware-efficient ghost module via re-parameterization.arXiv preprint arXiv:2211.06088, 2022
-
[12]
Run, don’t walk: chasing higher flops for faster neural networks
Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, and S-H Gary Chan. Run, don’t walk: chasing higher flops for faster neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12021–12031, 2023
2023
-
[13]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020
2020
-
[14]
Big self-supervised models are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020
2020
-
[15]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning.arXiv preprint arXiv:2003.04297, 2020
work page internal anchor Pith review arXiv 2003
-
[16]
An empirical study of training self-supervised vision transformers
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021
2021
-
[17]
Dual path networks.Advances in neural information processing systems, 30, 2017
Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks.Advances in neural information processing systems, 30, 2017. 10
2017
-
[18]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
2024
-
[20]
Visformer: The vision-friendly transformer
Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, and Qi Tian. Visformer: The vision-friendly transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 589–598, 2021
2021
-
[21]
Xception: Deep learning with depthwise separable convolutions
François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251– 1258, 2017
2017
-
[22]
Twins: Revisiting the design of spatial attention in vision transformers
Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. Advances in neural information processing systems, 34:9355–9366, 2021
2021
-
[23]
Paddleclas: An image classification toolkit based on paddlepaddle
PaddleClas Contributors. Paddleclas: An image classification toolkit based on paddlepaddle. https://github.com/PaddlePaddle/PaddleClas, 2020
2020
-
[24]
Repvgg: Making vgg-style convnets great again
Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021
2021
-
[25]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[26]
Calcium carbonate alters the functional response of coastal sediments to eutrophication-induced acidification.Scientific reports, 9(1):12012, 2019
Tarn P Drylie, Hazel R Needham, Andrew M Lohrer, Adam Hartland, and Conrad A Pilditch. Calcium carbonate alters the functional response of coastal sediments to eutrophication-induced acidification.Scientific reports, 9(1):12012, 2019
2019
-
[27]
Convit: Improving vision transformers with soft convolutional inductive biases
Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. In International conference on machine learning, pages 2286–2296. PMLR, 2021
2021
-
[28]
Swinfishnet: A swin transformer-based approach for automatic fish species classification using transfer learning.PLoS One, 20(5):e0322711, 2025
Ebru Ergün. Swinfishnet: A swin transformer-based approach for automatic fish species classification using transfer learning.PLoS One, 20(5):e0322711, 2025
2025
-
[29]
Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024
2024
-
[30]
Res2net: A new multi-scale backbone architecture.IEEE transactions on pattern analysis and machine intelligence, 43(2):652–662, 2019
Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture.IEEE transactions on pattern analysis and machine intelligence, 43(2):652–662, 2019
2019
-
[31]
Gemini 3.1 Pro: Model Card
Google DeepMind. Gemini 3.1 Pro: Model Card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026
2026
-
[32]
Levit: a vision transformer in convnet’s clothing for faster inference
Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. Levit: a vision transformer in convnet’s clothing for faster inference. InProceedings of the IEEE/CVF international conference on computer vision, pages 12259–12269, 2021
2021
-
[33]
Dongyoon Han, Sangdoo Yun, Byeongho Heo, and Y Yoo. Rexnet: Diminishing represen- tational bottleneck on convolutional neural network.arXiv preprint arXiv:2007.00992, 6:1, 2020. 11
-
[34]
Transformer in transformer.Advances in neural information processing systems, 34:15908–15919, 2021
Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.Advances in neural information processing systems, 34:15908–15919, 2021
2021
-
[35]
Global context vision transformers
Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Global context vision transformers. InInternational conference on machine learning, pages 12633– 12646. PMLR, 2023
2023
-
[36]
Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine- grained visual recognition by chain-of-thought reasoning.arXiv preprint arXiv:2602.07605, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
2016
-
[38]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020
2020
-
[39]
Bag of tricks for image classification with convolutional neural networks
Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 558–567, 2019
2019
-
[40]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019
work page internal anchor Pith review arXiv 1903
-
[41]
Rethinking spatial dimensions of vision transformers
Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 11936–11945, 2021
2021
-
[42]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021
2021
-
[43]
Searching for mobilenetv3
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314– 1324, 2019
2019
-
[44]
Densely connected convolutional networks
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017
2017
-
[45]
Fishnet: A large- scale dataset and benchmark for fish recognition, detection, and functional trait prediction
Faizan Farooq Khan, Xiang Li, Andrew J Temple, and Mohamed Elhoseiny. Fishnet: A large- scale dataset and benchmark for fish recognition, detection, and functional trait prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 20496– 20506, 2023
2023
-
[46]
Fishnet++: Analyzing the capabilities of multimodal large language models in marine biology
Faizan Farooq Khan, Yousef Radwan, Eslam Abdelrahman, Abdulwahab Felemban, Aymen Mir, Nico K Michiels, Andrew J Temple, Michael L Berumen, and Mohamed Elhoseiny. Fishnet++: Analyzing the capabilities of multimodal large language models in marine biology. arXiv preprint arXiv:2509.25564, 2025
-
[47]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InProceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013
2013
-
[48]
What matters when building vision-language models?Advances in Neural Information Processing Systems, 37: 87874–87907, 2024
Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?Advances in Neural Information Processing Systems, 37: 87874–87907, 2024
2024
-
[49]
An energy and gpu-computation efficient backbone network for real-time object detection
Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu-computation efficient backbone network for real-time object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 12
2019
-
[50]
Marine spatial planning makes room for offshore aquaculture in crowded coastal waters.Nature communications, 9(1):945, 2018
Sarah E Lester, JM Stevens, RR Gentry, CV Kappel, TW Bell, CJ Costello, SD Gaines, DA Kiefer, CC Maue, JE Rensel, et al. Marine spatial planning makes room for offshore aquaculture in crowded coastal waters.Nature communications, 9(1):945, 2018
2018
-
[51]
Exploring the underwater world segmentation without extra training
Bingyu Li, Tao Huo, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. Exploring the underwater world segmentation without extra training. InCVPR, pages 1–11, 2026
2026
-
[52]
Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, and Xin Pan. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios.arXiv preprint arXiv:2207.05501, 2022
-
[53]
Selective kernel networks
Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective kernel networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 510–519, 2019
2019
-
[54]
Mvitv2: Improved multiscale vision transformers for classification and detection
Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4804–4814, 2022
2022
-
[55]
Efficientformer: Vision transformers at mobilenet speed.Advances in neural information processing systems, 35:12934–12949, 2022
Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Efficientformer: Vision transformers at mobilenet speed.Advances in neural information processing systems, 35:12934–12949, 2022
2022
-
[56]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summariza- tion branches out, pages 74–81, 2004
2004
-
[57]
Progressive neural architecture search
Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European conference on computer vision (ECCV), pages 19–34, 2018
2018
-
[58]
Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021
Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.Advances in neural information processing systems, 34:9204–9215, 2021
2021
-
[59]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024
2024
-
[60]
Llavanext: Improved reasoning, ocr, and world knowledge
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge. https://llava-vl.github. io/blog/2024-01-30-llava-next/, 2024
2024
-
[61]
Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light.IEEE transactions on circuits and systems for video technology, 30(12):4861–4875, 2020
Risheng Liu, Xin Fan, Ming Zhu, Minjun Hou, and Zhongxuan Luo. Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light.IEEE transactions on circuits and systems for video technology, 30(12):4861–4875, 2020
2020
-
[62]
Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024
Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024
2024
-
[63]
Swin transformer v2: Scaling up capacity and resolution
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022
2022
-
[64]
A convnet for the 2020s
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022
2022
-
[65]
Rewrite the stars
Xu Ma, Xiyang Dai, Yue Bai, Yizhou Wang, and Yun Fu. Rewrite the stars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5694–5703, 2024. 13
2024
-
[66]
Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications
Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muhammad Anwer, and Fahad Shahbaz Khan. Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. InEuropean conference on computer vision, pages 3–20. Springer, 2022
2022
-
[67]
Xnect: Real-time multi-person 3d motion capture with a single rgb camera.Acm Transactions On Graphics (TOG), 39(4):82–1, 2020
Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed El- gharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and Christian Theobalt. Xnect: Real-time multi-person 3d motion capture with a single rgb camera.Acm Transactions On Graphics (TOG), 39(4):82–1, 2020
2020
-
[68]
Introducing GPT-5.4
OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026
2026
-
[69]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
2002
-
[71]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[72]
Design- ing network design spaces
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Design- ing network design spaces. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020
2020
-
[73]
Tresnet: High performance gpu-dedicated architecture
Tal Ridnik, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad Sharir, and Itamar Fried- man. Tresnet: High performance gpu-dedicated architecture. Inproceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1400–1409, 2021
2021
-
[74]
Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications
Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. InProceedings of the IEEE/CVF international conference on computer vision, pages 17425–17436, 2023
2023
-
[75]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[76]
Bottleneck transformers for visual recognition
Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16519–16529, 2021
2021
-
[77]
Bioclip: A vision foundation model for the tree of life
Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, et al. Bioclip: A vision foundation model for the tree of life. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19412–19424, 2024
2024
-
[78]
Re- thinking the inception architecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016
2016
-
[79]
Efficientnetv2: Smaller models and faster training
Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. InInterna- tional conference on machine learning, pages 10096–10106. PMLR, 2021
2021
-
[80]
Ghostnetv2: En- hance cheap operation with long-range attention.Advances in Neural Information Processing Systems, 35:9969–9982, 2022
Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chao Xu, and Yunhe Wang. Ghostnetv2: En- hance cheap operation with long-range attention.Advances in Neural Information Processing Systems, 35:9969–9982, 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.