Recognition: unknown
Can Multimodal Large Language Models Truly Understand Small Objects?
Pith reviewed 2026-05-08 12:45 UTC · model grok-4.3
The pith
Multimodal large language models exhibit weak capabilities in understanding small objects within images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that state-of-the-art multimodal large language models display limited ability to understand small objects. They demonstrate this through SOUBench by testing 15 models on the SOU-VQA dataset, which covers multiple real-world contexts. They further show that supervised fine-tuning on the companion SOU-Train dataset measurably strengthens these capabilities.
What carries the argument
SOUBench, a benchmark constructed via an automatic visual question-answer generation strategy that produces targeted VQA pairs to probe small object comprehension.
If this is right
- Fine-tuning on SOU-Train produces measurable gains in accuracy on small object VQA tasks.
- The six sub-tasks allow diagnosis of specific weaknesses such as localization or reasoning about tiny elements.
- Models trained with these datasets become more usable in driving, aerial, and underwater applications.
- Future MLLM design should incorporate methods to preserve fine visual details during processing.
- SOU-VQA serves as a reusable standard for tracking progress on small object capabilities.
Where Pith is reading between the lines
- Limitations in small object understanding could reduce reliability of MLLMs in safety-critical settings like autonomous navigation.
- The benchmark approach could be adapted to test other fine-grained visual skills beyond small objects.
- Architectural additions such as region-focused attention layers might complement the data-driven improvements shown here.
- Direct comparison of these models against classical small-object detectors would clarify whether language-model components add value or introduce new failure modes.
Load-bearing premise
The automatic visual question-answer generation strategy produces VQA pairs that validly and without bias measure small object understanding capabilities.
What would settle it
Human verification of the generated questions showing they fail to isolate small object understanding, or the 15 models scoring much higher on an independent set of manually created small-object questions.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SOUBench as the first benchmark for Small Object Understanding (SOU) in Multimodal Large Language Models (MLLMs). It proposes an automatic visual question-answer generation strategy to construct the SOU-VQA dataset containing 18,204 VQA pairs across six sub-tasks in three scenarios (Driving, Aerial, Underwater). The authors evaluate 15 state-of-the-art MLLMs and conclude they exhibit weak SOU capabilities. They further release SOU-Train (11,226 VQA pairs) and show that supervised fine-tuning improves performance on a recent MLLM, positioning the resources as an empirical foundation for future work.
Significance. If the SOU-VQA pairs validly isolate small-object localization and reasoning without generation artifacts, the work would usefully document a gap in current MLLMs for practically relevant small-object scenarios and demonstrate that targeted fine-tuning can address it. The public release of SOUBench, the two datasets, and code is a concrete strength that enables reproducibility and follow-on research.
major comments (1)
- [Dataset construction] Dataset construction section: the claim that the automatic VQA generation strategy is 'effective' is not supported by any reported validation (human review statistics, error analysis, checks for unanswerable questions, or bias audits). Because the headline finding of weak SOU capabilities across all 15 MLLMs rests exclusively on performance measured against these 18,204 pairs, the absence of such validation is load-bearing for the central empirical claim.
minor comments (2)
- [Abstract] The abstract states there are 'six relevant sub-tasks' but does not enumerate them; the main text should list and briefly define them near the first mention of SOU-VQA.
- [Fine-tuning experiments] The fine-tuning experiments report improvement but omit the precise baseline numbers, training hyperparameters, and any ablation isolating the contribution of SOU-Train versus other factors.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work introducing SOUBench and the associated datasets for small object understanding in MLLMs. We address the major comment on dataset validation below.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the claim that the automatic VQA generation strategy is 'effective' is not supported by any reported validation (human review statistics, error analysis, checks for unanswerable questions, or bias audits). Because the headline finding of weak SOU capabilities across all 15 MLLMs rests exclusively on performance measured against these 18,204 pairs, the absence of such validation is load-bearing for the central empirical claim.
Authors: We agree that the manuscript does not currently report human validation statistics, error analysis, checks for unanswerable questions, or bias audits for the automatically generated SOU-VQA pairs, and that this represents a gap given the central role of the benchmark in our empirical claims. The generation strategy relies on rule-based extraction from existing annotations in the source datasets (e.g., bounding-box filtering for small objects combined with template-based question formulation), which we intended to make reproducible and scalable. In the revised manuscript we will add a new subsection under Dataset Construction that includes: (i) results from human review of a stratified random sample of 1,000 VQA pairs by three independent annotators, reporting inter-annotator agreement (Cohen's kappa) and the fraction of pairs judged correct, answerable, and free of obvious generation artifacts; (ii) a categorized error analysis of any rejected pairs; and (iii) summary statistics on question answerability and scenario/sub-task balance to address potential biases. These additions will directly substantiate the effectiveness claim and allow readers to assess the benchmark's validity. revision: yes
Circularity Check
No circularity: purely empirical benchmark creation and evaluation
full rationale
The paper constructs SOUBench via an automatic VQA generation pipeline, releases SOU-VQA (18,204 pairs) and SOU-Train (11,226 pairs), evaluates 15 MLLMs, and reports performance numbers. No equations, fitted parameters, predictions derived from prior results, uniqueness theorems, or self-citation chains appear in the derivation. All central claims are direct empirical measurements on the newly created datasets; the work is self-contained against external benchmarks and contains no load-bearing steps that reduce to their own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
An, X., Sun, J., Gui, Z., He, W.: Choice: benchmarking the remote sensing capa- bilities of large vision-language models. arXiv preprint arXiv:2411.18145 (2024)
-
[2]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review arXiv 2025
-
[3]
Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems37, 27056– 27087 (2024)
2024
-
[4]
arXiv preprint arXiv:2004.12432 (2020)
Chen, Y., Zhang, P., Li, Z., Li, Y., Zhang, X., Qi, L., Sun, J., Jia, J.: Dynamic scale training for object detection. arXiv preprint arXiv:2004.12432 (2020)
-
[5]
Cheng, G., Yuan, X., Yao, X., Yan, K., Zeng, Q., Xie, X., Han, J.: Towards large- scalesmallobjectdetection:Surveyandbenchmarks.IEEEtransactionsonpattern analysis and machine intelligence45(11), 13467–13488 (2023)
2023
-
[6]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
work page internal anchor Pith review arXiv 2021
-
[7]
DeepMind, G.: Gemini 3 flash: frontier intelligence built for speed (2025),https: //blog.google/products-and-platforms/products/gemini/gemini-3-flash/
2025
-
[8]
arXiv preprint arXiv:2509.18189 (2025)
Dong, D., Zheng, M., Xu, D., Zhuang, B., Zhang, W., Luo, C., Wang, H., Zhao, Z., Li, J., Li, Y., et al.: Qianfan-vl: Domain-enhanced universal vision-language models. arXiv preprint arXiv:2509.18189 (2025)
-
[9]
In: Proceedings of the IEEE/CVF international conference on computer vision workshops
Du, D., Zhu, P., Wen, L., Bian, X., Lin, H., Hu, Q., Peng, T., Zheng, J., Wang, X., Zhang, Y., et al.: Visdrone-det2019: The vision meets drone object detection in image challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision workshops. pp. 0–0 (2019) 16 Han et al
2019
-
[10]
In: Proceedings of the 32nd ACM International Conference on Multimedia
Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024)
2024
-
[11]
Neuro- computing517, 243–256 (2023)
Fu, C., Liu, R., Fan, X., Chen, P., Fu, H., Yuan, W., Zhu, M., Luo, Z.: Rethinking general underwater object detection: Datasets, challenges, and solutions. Neuro- computing517, 243–256 (2023)
2023
-
[12]
arXiv preprint arXiv:2507.21649 (2025)
Gao, S., Yang, P., Guo, H., Liu, Y., Chen, Y., Li, S., Zhu, H., Xu, J., Zhang, X.Y., Huang, L.: The evolution of video anomaly detection: A unified framework from dnn to mllm. arXiv preprint arXiv:2507.21649 (2025)
-
[13]
In: The Fourteenth International Conference on Learning Representations
Han, F., Ye, J., Zhang, C., Ye, P.: Ovid: Open-vocabulary intrusion detection. In: The Fourteenth International Conference on Learning Representations
-
[14]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems
Han, F., Ye, P.: Mllm-isu: The first-ever comprehensive benchmark for multimodal large language models based intrusion scene understanding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[15]
In: Proceedings of the 32nd ACM International Conference on Multimedia
Han, F., Ye, P., Duan, S., Wang, L.: Ada-id: Active domain adaptation for in- trusion detection. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 7404–7413 (2024)
2024
-
[16]
IEEE Transactions on Automation Science and Engineering22, 3582–3597 (2024)
Han, F., Ye, P., Li, K., Duan, S., Wang, L.: Mf-id: a benchmark and approach for multi-category fine-grained intrusion detection. IEEE Transactions on Automation Science and Engineering22, 3582–3597 (2024)
2024
-
[17]
Coralvqa: A large-scale visual question answering dataset for coral reef image understanding,
Han, H., Wang, W., Zhang, G., Li, M., Wang, Y.: Coralvqa: A large-scale vi- sual question answering dataset for coral reef image understanding. arXiv preprint arXiv:2507.10449 (2025)
-
[18]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021)
work page internal anchor Pith review arXiv 2021
-
[19]
In: International Conference on Learning Representations (2022)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
2022
-
[20]
In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision
Li, J., Xie, C., Ao, J., Leng, D., Yin, Y.: Lmm-det: Make large multimodal models excel in object detection. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 308–318 (2025)
2025
-
[21]
IEEE Transactions on Geoscience and Remote Sensing (2025)
Li, K., Wang, Y., Han, F., Wang, H., Xiong, Z., Tian, Y.: Hstnet: A hybrid spatial-channel sparse transformer network for infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing (2025)
2025
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)
2024
-
[23]
Advances in Neural Information Processing Systems37, 3229–3242 (2024)
Li, X., Ding, J., Elhoseiny, M.: Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding. Advances in Neural Information Processing Systems37, 3229–3242 (2024)
2024
-
[24]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Li, Y., Tian, M., Lin, Z., Zhu, J., Zhu, D., Liu, H., Zhang, Y., Xiong, Z., Zhao, X.: Fine-grained evaluation of large vision-language models in autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9431–9442 (2025)
2025
-
[25]
In: European conference on computer vision
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014) SOUBench 17
2014
-
[26]
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)
2024
-
[27]
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding (2024)
2024
-
[28]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Ma, W., Chen, H., Zhang, G., Chou, Y.C., Chen, J., de Melo, C., Yuille, A.: 3dsr- bench: A comprehensive 3d spatial reasoning benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6924–6934 (2025)
2025
-
[29]
Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022)
-
[30]
In: European Conference on Computer Vision
Muhtar, D., Li, Z., Gu, F., Zhang, X., Xiao, P.: Lhrs-bot: Empowering remote sens- ing with vgi-enhanced large multimodal language model. In: European Conference on Computer Vision. pp. 440–457. Springer (2024)
2024
-
[31]
Ning, M., Zhu, B., Xie, Y., Lin, B., Cui, J., Yuan, L., Chen, D., Yuan, L.: Video- bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103 (2023)
-
[32]
OpenAI: Introducing gpt-5.2 (2025),https://openai.com/index/introducing- gpt-5-2/
2025
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Park, S.Y., Cui, C., Ma, Y., Moradipari, A., Gupta, R., Han, K., Wang, Z.: Nu- planqa: A large-scale dataset and benchmark for multi-view driving scene under- standing in multi-modal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8066–8076 (2025)
2025
-
[34]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)
work page internal anchor Pith review arXiv 2025
-
[35]
Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)
work page internal anchor Pith review arXiv 2025
-
[36]
Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., Wang, C., Zhang, D., Du, D., Wang, D., Yuan, E., Lu, E., Li, F., Sung, F., Wei, G., Lai, G., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H., Wu, H., Yao, H., Lu, H., Wang, H., Gao, H., Zheng, H., Li, J., Su, J., Wang, J., Deng, J., Qiu, J., Xie, J., Wang, J., Liu,...
work page internal anchor Pith review arXiv 2025
-
[37]
arXiv preprint arXiv:2503.14935 (2025)
Tu, C., Zhang, L., Chen, P., Ye, P., Zeng, X., Cheng, W., Yu, G., Chen, T.: Favor- bench: A comprehensive benchmark for fine-grained video motion understanding. arXiv preprint arXiv:2503.14935 (2025)
-
[38]
Wang, F., Wang, H., Guo, Z., Wang, D., Wang, Y., Chen, M., Ma, Q., Lan, L., Yang, W., Zhang, J., et al.: Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14325–14336 (2025) 18 Han et al
2025
-
[39]
In: 2020 25th international conference on pattern recognition (ICPR)
Wang, J., Yang, W., Guo, H., Zhang, R., Xia, G.S.: Tiny object detection in aerial images. In: 2020 25th international conference on pattern recognition (ICPR). pp. 3791–3798. IEEE (2021)
2020
-
[40]
Wang, J., Xuan, W., Qi, H., Liu, Z., Liu, K., Wu, Y., Chen, H., Song, J., Xia, J., Zheng, Z., et al.: Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response. arXiv preprint arXiv:2505.21089 (2025)
-
[41]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review arXiv 2025
-
[42]
arXiv preprint arXiv:2401.07529 (2024)
Wang, Y., Liao, Y., Liu, H., Liu, H., Wang, Y., Wang, Y.: Mm-sap: A comprehen- sive benchmark for assessing self-awareness of multimodal large language models in perception. arXiv preprint arXiv:2401.07529 (2024)
-
[43]
IEEE Transactions on Image Processing32, 364–376 (2022)
Wu, X., Hong, D., Chanussot, J.: Uiu-net: U-net in u-net for infrared small object detection. IEEE Transactions on Image Processing32, 364–376 (2022)
2022
-
[44]
Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., Xie, Z., Wu, Y., Hu, K., Wang, J., Sun, Y., Li, Y., Piao, Y., Guan, K., Liu, A., Xie, X., You, Y., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y., Ruan, C.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding (2024),https://a...
work page internal anchor Pith review arXiv 2024
-
[45]
xAI: Grok 4.1 (2025),https://data.x.ai/2025-11-17-grok-4-1-model-card. pdf
2025
-
[46]
In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision
Xie, S., Kong, L., Dong, Y., Sima, C., Zhang, W., Chen, Q.A., Liu, Z., Pan, L.: Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 6585–6597 (2025)
2025
-
[47]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review arXiv 2025
-
[48]
Advances in Neural Information Processing Sys- tems37, 94327–94427 (2024)
Ye, J., Wang, G., Li, Y., Deng, Z., Li, W., Li, T., Duan, H., Huang, Z., Su, Y., Wang, B., et al.: Gmai-mmbench: A comprehensive multimodal evaluation bench- mark towards general medical ai. Advances in Neural Information Processing Sys- tems37, 94327–94427 (2024)
2024
-
[49]
IEEE Transactions on Circuits and Systems for Video Technology (2025)
Ye, P., Huang, C., Shen, M., Chen, T., Huang, Y., Ouyang, W.: Dynamic model merging with mixture of weights. IEEE Transactions on Circuits and Systems for Video Technology (2025)
2025
- [50]
-
[51]
Yu, T., Wang, Z., Wang, C., Huang, F., Ma, W., He, Z., Cai, T., Chen, W., Huang, Y., Zhao, Y., Xu, B., Cui, J., Xu, Y., Ruan, L., Zhang, L., Liu, H., Tang, J., Liu, H., Guo, Q., Hu, W., He, B., Zhou, J., Cai, J., Qi, J., Guo, Z., Chen, C., Zeng, G., Li, Y., Cui, G., Ding, N., Han, X., Yao, Y., Liu, Z., Sun, M.: Minicpm- v 4.5: Cooking efficient mllms via ...
-
[52]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9556– 9567 (2024)
2024
-
[53]
arXiv preprint arXiv:2510.18262 (2025) SOUBench 19
Zhang, D., Rong, C., Li, B., Wang, F., Zhao, Z., Gao, J., Li, X.: Uwbench: A comprehensive vision-language benchmark for underwater understanding. arXiv preprint arXiv:2510.18262 (2025) SOUBench 19
-
[54]
How many small objects are present in this image?
Zhang, W., Cai, M., Zhang, T., Zhuang, Y., Mao, X.: Earthgpt: A universal multi- modallargelanguagemodelformultisensorimagecomprehensioninremotesensing domain. IEEE Transactions on Geoscience and Remote Sensing62, 1–20 (2024) 20 Han et al. APPENDIX OVERVIEW Table of contents: • §A: The detailed generation principles of VQA Pairs • §B: The details of preli...
2024
-
[55]
Small Object Definition: An instance is classified as a small object if its absolute area is less than and equal to 1024 pixels
-
[56]
Predefined Small Object Category List: [‘people’,‘rider’,‘bicycle’, ‘motor’,‘vehicle’,‘traffic-sign’,‘traffic-light’,‘traffic-camera’,‘warning- cone’].←If test ‘Driving’ scenario, please notice it!
-
[57]
PredefinedSmall Object Category List: [‘airplane’,‘helicopter’,‘small- vehicle’,‘large-vehicle’,‘ship’,‘container’,‘storage-tank’,‘swimming- pool’,‘windmill’].←If test ‘Aerial (SODA)’ scenario, please notice it!
-
[58]
Predefined Small Object Category List: [‘airplane’,‘bridge’,‘storage- tank’,‘ship’,‘swimming-pool’,‘vehicle’,‘person’,‘wind-mill’].←If test ‘Aerial (AITOD)’ scenario, please notice it!
-
[59]
PredefinedSmall Object Category List: [‘pedestrian’,‘person’,‘bicycle’, ‘car’,‘van’,‘truck’,‘tricycle’,‘awning-tricycle’,‘bus’,‘motor’].←If test ‘Aerial (VisDrone)’ scenario, please notice it!
-
[60]
Predefined Small Object Category List: [‘holothurian’,‘echinus’,‘scal- lop’,‘starfish’,‘fish’,‘corals’,‘diver’,‘cuttlefish’,‘turtle’,‘jellyfish’].←If test ‘Underwater’ scenario, please notice it! object recognition capabilities in Aerial scenarios, with the gap reaching as high as 21.88%. This performance gap reveals that our current MLLMs still have a lo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.