pith. machine review for the scientific record. sign in

arxiv: 2604.22884 · v1 · submitted 2026-04-24 · 💻 cs.CV · cs.AI

Recognition: unknown

Can Multimodal Large Language Models Truly Understand Small Objects?

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords small object understandingmultimodal large language modelsvisual question answeringbenchmarkSOUBenchSOU-VQASOU-Trainfine-tuning
0
0 comments X

The pith

Multimodal large language models exhibit weak capabilities in understanding small objects within images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SOUBench as the first benchmark to assess how multimodal large language models handle small objects. It builds the SOU-VQA dataset of 18,204 question-answer pairs using an automatic generation method, spanning six sub-tasks in driving, aerial, and underwater scenarios. Evaluation of 15 current models shows consistent weaknesses across these tasks. The authors also release SOU-Train, a set of 11,226 pairs, and demonstrate that fine-tuning the latest model on it improves small object performance. The work supplies datasets and a testing framework to guide further progress in this area.

Core claim

The authors establish that state-of-the-art multimodal large language models display limited ability to understand small objects. They demonstrate this through SOUBench by testing 15 models on the SOU-VQA dataset, which covers multiple real-world contexts. They further show that supervised fine-tuning on the companion SOU-Train dataset measurably strengthens these capabilities.

What carries the argument

SOUBench, a benchmark constructed via an automatic visual question-answer generation strategy that produces targeted VQA pairs to probe small object comprehension.

If this is right

  • Fine-tuning on SOU-Train produces measurable gains in accuracy on small object VQA tasks.
  • The six sub-tasks allow diagnosis of specific weaknesses such as localization or reasoning about tiny elements.
  • Models trained with these datasets become more usable in driving, aerial, and underwater applications.
  • Future MLLM design should incorporate methods to preserve fine visual details during processing.
  • SOU-VQA serves as a reusable standard for tracking progress on small object capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Limitations in small object understanding could reduce reliability of MLLMs in safety-critical settings like autonomous navigation.
  • The benchmark approach could be adapted to test other fine-grained visual skills beyond small objects.
  • Architectural additions such as region-focused attention layers might complement the data-driven improvements shown here.
  • Direct comparison of these models against classical small-object detectors would clarify whether language-model components add value or introduce new failure modes.

Load-bearing premise

The automatic visual question-answer generation strategy produces VQA pairs that validly and without bias measure small object understanding capabilities.

What would settle it

Human verification of the generated questions showing they fail to isolate small object understanding, or the 15 models scoring much higher on an independent set of manually created small-object questions.

Figures

Figures reproduced from arXiv: 2604.22884 by Fujun Han, Jingqi Ye, Junan Chen, Peng Ye, Tao Chen, Xintong Zhu, Xuanjie Mao.

Figure 1
Figure 1. Figure 1: Performance and comparison on differ￾ent small object understanding tasks. We can find that, in several common small object under￾standing sub-tasks, the promising MLLMs pro￾vide incorrect answers. These phenomena indi￾cate that the current state-of-the-art Multimodal Large Language Models still exhibit weak small object understanding capabilities. Despite current MLLMs show strong capabilities and perfor￾… view at source ↗
Figure 2
Figure 2. Figure 2: The Performance and Comparison of 15 dominant MLLMs on the proposed small object understanding (SOU) task. We can find that the best performance is 58.97% of GPT-5.2. However, compared to human performance (82.50%), even the best GPT-5.2 still exhibits a significant gap, which indicates that current MLLMs have weak capabilities in given SOU tasks. ios, i.e., Driving, Aerial, and Underwater. Finally, we gen… view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of automated generation strategy for Visual Question Answering (VQA). Our generation strategy primarily consists of six steps, i.e., Dataset Input, Preprocessing, Annotation Extraction, QA Generation, Option Generation, and QA output. In the QA Generation and Option Generation stage, we design automated procedures to obtain VQA pairs for six subtasks. Note that our strategy can be ex￾tended to… view at source ↗
Figure 4
Figure 4. Figure 4: An overview of the proposed SOU-VQA. Our VQA consists of three dominant small object scenarios: Driving, Aerial, and UnderWater. In each scenario, we design six distinct sub-tasks to comprehensively evaluate MLLMs’ understanding ability of small objects, i.e., foundational Perception sub-tasks: Category Enumeration, Object Count￾ing, Spatial Reasoning sub-tasks: Category Recognition, Object Location, Fine-… view at source ↗
Figure 5
Figure 5. Figure 5: The detailed dataset statistics. Left: Data statistics. Right: Datasets Source. 1024px) are retained for our benchmark. Besides, to obtain the final small object datasets, for some original datasets, we adopt transformation techniques to resize images to meet the small object criteria. Thirdly, we automate the generation of the final VQA pairs through three sequential steps: 1) Annotation Extraction: The s… view at source ↗
Figure 6
Figure 6. Figure 6: The performance of five different model scales. Driving, Aerial, and Underwa￾ter denote the three different scenarios of the proposed SOU-VQA dataset. Note that the performance of Aerial is the overall performance of A-VisDrone, A-AITOD, and A-SODA, as shown in the three subplots on the right. We can observe that, in Driving, Aerial, and Underwater three scenarios, as the scale of the model increases, the … view at source ↗
Figure 7
Figure 7. Figure 7: The performance impact of different training costs. Experiments 3: Generalization Exploration. In this experiment, we con￾duct an interesting generalization experiment. We attempt to directly apply the supervised fine-tuning model to other scenarios for testing, e.g., from Under￾water scenarios to Driving scenarios. Specifically, we first perform SFT training in Underwater scenarios and then test the SFT m… view at source ↗
Figure 8
Figure 8. Figure 8: The detailed principles of the proposed automated generation strategy of visual question answering. The principal mainly consists of three modules: Sub-task Question Template Design, Correct Answer Generation, and Incorrect Options Generation. Object Category List based on the dataset name. Please note that yellow, lime, and pink denote the Driving, Aerial, and Underwater scenarios, respectively. C: Evalua… view at source ↗
Figure 9
Figure 9. Figure 9: The more detailed results of different model scales. Here, we report more results in different scenarios and sub-datasets. We reveal the shortcomings of current MLLMs across different sub-tasks and suggest some directions for future improvements, e.g., in the Aerial (A-VisDrone) scenario, as the model scale increases, the performance of the model in POI (Peripheral Object Identification)is decreased. These… view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SOUBench as the first benchmark for Small Object Understanding (SOU) in Multimodal Large Language Models (MLLMs). It proposes an automatic visual question-answer generation strategy to construct the SOU-VQA dataset containing 18,204 VQA pairs across six sub-tasks in three scenarios (Driving, Aerial, Underwater). The authors evaluate 15 state-of-the-art MLLMs and conclude they exhibit weak SOU capabilities. They further release SOU-Train (11,226 VQA pairs) and show that supervised fine-tuning improves performance on a recent MLLM, positioning the resources as an empirical foundation for future work.

Significance. If the SOU-VQA pairs validly isolate small-object localization and reasoning without generation artifacts, the work would usefully document a gap in current MLLMs for practically relevant small-object scenarios and demonstrate that targeted fine-tuning can address it. The public release of SOUBench, the two datasets, and code is a concrete strength that enables reproducibility and follow-on research.

major comments (1)
  1. [Dataset construction] Dataset construction section: the claim that the automatic VQA generation strategy is 'effective' is not supported by any reported validation (human review statistics, error analysis, checks for unanswerable questions, or bias audits). Because the headline finding of weak SOU capabilities across all 15 MLLMs rests exclusively on performance measured against these 18,204 pairs, the absence of such validation is load-bearing for the central empirical claim.
minor comments (2)
  1. [Abstract] The abstract states there are 'six relevant sub-tasks' but does not enumerate them; the main text should list and briefly define them near the first mention of SOU-VQA.
  2. [Fine-tuning experiments] The fine-tuning experiments report improvement but omit the precise baseline numbers, training hyperparameters, and any ablation isolating the contribution of SOU-Train versus other factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work introducing SOUBench and the associated datasets for small object understanding in MLLMs. We address the major comment on dataset validation below.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the claim that the automatic VQA generation strategy is 'effective' is not supported by any reported validation (human review statistics, error analysis, checks for unanswerable questions, or bias audits). Because the headline finding of weak SOU capabilities across all 15 MLLMs rests exclusively on performance measured against these 18,204 pairs, the absence of such validation is load-bearing for the central empirical claim.

    Authors: We agree that the manuscript does not currently report human validation statistics, error analysis, checks for unanswerable questions, or bias audits for the automatically generated SOU-VQA pairs, and that this represents a gap given the central role of the benchmark in our empirical claims. The generation strategy relies on rule-based extraction from existing annotations in the source datasets (e.g., bounding-box filtering for small objects combined with template-based question formulation), which we intended to make reproducible and scalable. In the revised manuscript we will add a new subsection under Dataset Construction that includes: (i) results from human review of a stratified random sample of 1,000 VQA pairs by three independent annotators, reporting inter-annotator agreement (Cohen's kappa) and the fraction of pairs judged correct, answerable, and free of obvious generation artifacts; (ii) a categorized error analysis of any rejected pairs; and (iii) summary statistics on question answerability and scenario/sub-task balance to address potential biases. These additions will directly substantiate the effectiveness claim and allow readers to assess the benchmark's validity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark creation and evaluation

full rationale

The paper constructs SOUBench via an automatic VQA generation pipeline, releases SOU-VQA (18,204 pairs) and SOU-Train (11,226 pairs), evaluates 15 MLLMs, and reports performance numbers. No equations, fitted parameters, predictions derived from prior results, uniqueness theorems, or self-citation chains appear in the derivation. All central claims are direct empirical measurements on the newly created datasets; the work is self-contained against external benchmarks and contains no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper; no free parameters, axioms, or invented entities are introduced in the central claim.

pith-pipeline@v0.9.0 · 5573 in / 1127 out tokens · 32689 ms · 2026-05-08T12:45:32.867230+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 23 canonical work pages · 9 internal anchors

  1. [1]

    Choice: Benchmarking the remote sensing capabilities of large vision-language models.arXiv preprint arXiv:2411.18145, 2024

    An, X., Sun, J., Gui, Z., He, W.: Choice: benchmarking the remote sensing capa- bilities of large vision-language models. arXiv preprint arXiv:2411.18145 (2024)

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  3. [3]

    Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems37, 27056– 27087 (2024)

  4. [4]

    arXiv preprint arXiv:2004.12432 (2020)

    Chen, Y., Zhang, P., Li, Z., Li, Y., Zhang, X., Qi, L., Sun, J., Jia, J.: Dynamic scale training for object detection. arXiv preprint arXiv:2004.12432 (2020)

  5. [5]

    Cheng, G., Yuan, X., Yao, X., Yan, K., Zeng, Q., Xie, X., Han, J.: Towards large- scalesmallobjectdetection:Surveyandbenchmarks.IEEEtransactionsonpattern analysis and machine intelligence45(11), 13467–13488 (2023)

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  7. [7]

    DeepMind, G.: Gemini 3 flash: frontier intelligence built for speed (2025),https: //blog.google/products-and-platforms/products/gemini/gemini-3-flash/

  8. [8]

    arXiv preprint arXiv:2509.18189 (2025)

    Dong, D., Zheng, M., Xu, D., Zhuang, B., Zhang, W., Luo, C., Wang, H., Zhao, Z., Li, J., Li, Y., et al.: Qianfan-vl: Domain-enhanced universal vision-language models. arXiv preprint arXiv:2509.18189 (2025)

  9. [9]

    In: Proceedings of the IEEE/CVF international conference on computer vision workshops

    Du, D., Zhu, P., Wen, L., Bian, X., Lin, H., Hu, Q., Peng, T., Zheng, J., Wang, X., Zhang, Y., et al.: Visdrone-det2019: The vision meets drone object detection in image challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision workshops. pp. 0–0 (2019) 16 Han et al

  10. [10]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024)

  11. [11]

    Neuro- computing517, 243–256 (2023)

    Fu, C., Liu, R., Fan, X., Chen, P., Fu, H., Yuan, W., Zhu, M., Luo, Z.: Rethinking general underwater object detection: Datasets, challenges, and solutions. Neuro- computing517, 243–256 (2023)

  12. [12]

    arXiv preprint arXiv:2507.21649 (2025)

    Gao, S., Yang, P., Guo, H., Liu, Y., Chen, Y., Li, S., Zhu, H., Xu, J., Zhang, X.Y., Huang, L.: The evolution of video anomaly detection: A unified framework from dnn to mllm. arXiv preprint arXiv:2507.21649 (2025)

  13. [13]

    In: The Fourteenth International Conference on Learning Representations

    Han, F., Ye, J., Zhang, C., Ye, P.: Ovid: Open-vocabulary intrusion detection. In: The Fourteenth International Conference on Learning Representations

  14. [14]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems

    Han, F., Ye, P.: Mllm-isu: The first-ever comprehensive benchmark for multimodal large language models based intrusion scene understanding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems

  15. [15]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Han, F., Ye, P., Duan, S., Wang, L.: Ada-id: Active domain adaptation for in- trusion detection. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 7404–7413 (2024)

  16. [16]

    IEEE Transactions on Automation Science and Engineering22, 3582–3597 (2024)

    Han, F., Ye, P., Li, K., Duan, S., Wang, L.: Mf-id: a benchmark and approach for multi-category fine-grained intrusion detection. IEEE Transactions on Automation Science and Engineering22, 3582–3597 (2024)

  17. [17]

    Coralvqa: A large-scale visual question answering dataset for coral reef image understanding,

    Han, H., Wang, W., Zhang, G., Li, M., Wang, Y.: Coralvqa: A large-scale vi- sual question answering dataset for coral reef image understanding. arXiv preprint arXiv:2507.10449 (2025)

  18. [18]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021)

  19. [19]

    In: International Conference on Learning Representations (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

  20. [20]

    In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision

    Li, J., Xie, C., Ao, J., Leng, D., Yin, Y.: Lmm-det: Make large multimodal models excel in object detection. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 308–318 (2025)

  21. [21]

    IEEE Transactions on Geoscience and Remote Sensing (2025)

    Li, K., Wang, Y., Han, F., Wang, H., Xiong, Z., Tian, Y.: Hstnet: A hybrid spatial-channel sparse transformer network for infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing (2025)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

  23. [23]

    Advances in Neural Information Processing Systems37, 3229–3242 (2024)

    Li, X., Ding, J., Elhoseiny, M.: Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding. Advances in Neural Information Processing Systems37, 3229–3242 (2024)

  24. [24]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, Y., Tian, M., Lin, Z., Zhu, J., Zhu, D., Liu, H., Zhang, Y., Xiong, Z., Zhao, X.: Fine-grained evaluation of large vision-language models in autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9431–9442 (2025)

  25. [25]

    In: European conference on computer vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014) SOUBench 17

  26. [26]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)

  27. [27]

    Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y., Deng, C., Xu, H., Xie, Z., Ruan, C.: Deepseek-vl: Towards real-world vision-language understanding (2024)

  28. [28]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ma, W., Chen, H., Zhang, G., Chou, Y.C., Chen, J., de Melo, C., Yuille, A.: 3dsr- bench: A comprehensive 3d spatial reasoning benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6924–6934 (2025)

  29. [29]

    Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

    Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244 (2022)

  30. [30]

    In: European Conference on Computer Vision

    Muhtar, D., Li, Z., Gu, F., Zhang, X., Xiao, P.: Lhrs-bot: Empowering remote sens- ing with vgi-enhanced large multimodal language model. In: European Conference on Computer Vision. pp. 440–457. Springer (2024)

  31. [31]

    Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models

    Ning, M., Zhu, B., Xie, Y., Lin, B., Cui, J., Yuan, L., Chen, D., Yuan, L.: Video- bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103 (2023)

  32. [32]

    OpenAI: Introducing gpt-5.2 (2025),https://openai.com/index/introducing- gpt-5-2/

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Park, S.Y., Cui, C., Ma, Y., Moradipari, A., Gupta, R., Han, K., Wang, Z.: Nu- planqa: A large-scale dataset and benchmark for multi-view driving scene under- standing in multi-modal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8066–8076 (2025)

  34. [34]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

  35. [35]

    Gemma 3 Technical Report

    Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

  36. [36]

    Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., Wang, C., Zhang, D., Du, D., Wang, D., Yuan, E., Lu, E., Li, F., Sung, F., Wei, G., Lai, G., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H., Wu, H., Yao, H., Lu, H., Wang, H., Gao, H., Zheng, H., Li, J., Su, J., Wang, J., Deng, J., Qiu, J., Xie, J., Wang, J., Liu,...

  37. [37]

    arXiv preprint arXiv:2503.14935 (2025)

    Tu, C., Zhang, L., Chen, P., Ye, P., Zeng, X., Cheng, W., Yu, G., Chen, T.: Favor- bench: A comprehensive benchmark for fine-grained video motion understanding. arXiv preprint arXiv:2503.14935 (2025)

  38. [38]

    Wang, F., Wang, H., Guo, Z., Wang, D., Wang, Y., Chen, M., Ma, Q., Lan, L., Yang, W., Zhang, J., et al.: Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14325–14336 (2025) 18 Han et al

  39. [39]

    In: 2020 25th international conference on pattern recognition (ICPR)

    Wang, J., Yang, W., Guo, H., Zhang, R., Xia, G.S.: Tiny object detection in aerial images. In: 2020 25th international conference on pattern recognition (ICPR). pp. 3791–3798. IEEE (2021)

  40. [40]

    Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response.arXiv preprint arXiv:2505.21089, 2025

    Wang, J., Xuan, W., Qi, H., Liu, Z., Liu, K., Wu, Y., Chen, H., Song, J., Xia, J., Zheng, Z., et al.: Disasterm3: A remote sensing vision-language dataset for disaster damage assessment and response. arXiv preprint arXiv:2505.21089 (2025)

  41. [41]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  42. [42]

    arXiv preprint arXiv:2401.07529 (2024)

    Wang, Y., Liao, Y., Liu, H., Liu, H., Wang, Y., Wang, Y.: Mm-sap: A comprehen- sive benchmark for assessing self-awareness of multimodal large language models in perception. arXiv preprint arXiv:2401.07529 (2024)

  43. [43]

    IEEE Transactions on Image Processing32, 364–376 (2022)

    Wu, X., Hong, D., Chanussot, J.: Uiu-net: U-net in u-net for infrared small object detection. IEEE Transactions on Image Processing32, 364–376 (2022)

  44. [44]

    Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., Xie, Z., Wu, Y., Hu, K., Wang, J., Sun, Y., Li, Y., Piao, Y., Guan, K., Liu, A., Xie, X., You, Y., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y., Ruan, C.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding (2024),https://a...

  45. [45]

    xAI: Grok 4.1 (2025),https://data.x.ai/2025-11-17-grok-4-1-model-card. pdf

  46. [46]

    In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision

    Xie, S., Kong, L., Dong, Y., Sima, C., Zhang, W., Chen, Q.A., Liu, Z., Pan, L.: Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 6585–6597 (2025)

  47. [47]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  48. [48]

    Advances in Neural Information Processing Sys- tems37, 94327–94427 (2024)

    Ye, J., Wang, G., Li, Y., Deng, Z., Li, W., Li, T., Duan, H., Huang, Z., Su, Y., Wang, B., et al.: Gmai-mmbench: A comprehensive multimodal evaluation bench- mark towards general medical ai. Advances in Neural Information Processing Sys- tems37, 94327–94427 (2024)

  49. [49]

    IEEE Transactions on Circuits and Systems for Video Technology (2025)

    Ye, P., Huang, C., Shen, M., Chen, T., Huang, Y., Ouyang, W.: Dynamic model merging with mixture of weights. IEEE Transactions on Circuits and Systems for Video Technology (2025)

  50. [50]

    Yu, F., Wan, H., Cheng, Q., Zhang, Y., Chen, J., Han, F., Wu, Y., Yao, J., Hu, R., Ding, N., et al.: Hipho: How far are (m) llms from humans in the latest high school physics olympiad benchmark? arXiv preprint arXiv:2509.07894 (2025)

  51. [51]

    Yu, T., Wang, Z., Wang, C., Huang, F., Ma, W., He, Z., Cai, T., Chen, W., Huang, Y., Zhao, Y., Xu, B., Cui, J., Xu, Y., Ruan, L., Zhang, L., Liu, H., Tang, J., Liu, H., Guo, Q., Hu, W., He, B., Zhou, J., Cai, J., Qi, J., Guo, Z., Chen, C., Zeng, G., Li, Y., Cui, G., Ding, N., Han, X., Yao, Y., Liu, Z., Sun, M.: Minicpm- v 4.5: Cooking efficient mllms via ...

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9556– 9567 (2024)

  53. [53]

    arXiv preprint arXiv:2510.18262 (2025) SOUBench 19

    Zhang, D., Rong, C., Li, B., Wang, F., Zhao, Z., Gao, J., Li, X.: Uwbench: A comprehensive vision-language benchmark for underwater understanding. arXiv preprint arXiv:2510.18262 (2025) SOUBench 19

  54. [54]

    How many small objects are present in this image?

    Zhang, W., Cai, M., Zhang, T., Zhuang, Y., Mao, X.: Earthgpt: A universal multi- modallargelanguagemodelformultisensorimagecomprehensioninremotesensing domain. IEEE Transactions on Geoscience and Remote Sensing62, 1–20 (2024) 20 Han et al. APPENDIX OVERVIEW Table of contents: • §A: The detailed generation principles of VQA Pairs • §B: The details of preli...

  55. [55]

    Small Object Definition: An instance is classified as a small object if its absolute area is less than and equal to 1024 pixels

  56. [56]

    Predefined Small Object Category List: [‘people’,‘rider’,‘bicycle’, ‘motor’,‘vehicle’,‘traffic-sign’,‘traffic-light’,‘traffic-camera’,‘warning- cone’].←If test ‘Driving’ scenario, please notice it!

  57. [57]

    PredefinedSmall Object Category List: [‘airplane’,‘helicopter’,‘small- vehicle’,‘large-vehicle’,‘ship’,‘container’,‘storage-tank’,‘swimming- pool’,‘windmill’].←If test ‘Aerial (SODA)’ scenario, please notice it!

  58. [58]

    Predefined Small Object Category List: [‘airplane’,‘bridge’,‘storage- tank’,‘ship’,‘swimming-pool’,‘vehicle’,‘person’,‘wind-mill’].←If test ‘Aerial (AITOD)’ scenario, please notice it!

  59. [59]

    PredefinedSmall Object Category List: [‘pedestrian’,‘person’,‘bicycle’, ‘car’,‘van’,‘truck’,‘tricycle’,‘awning-tricycle’,‘bus’,‘motor’].←If test ‘Aerial (VisDrone)’ scenario, please notice it!

  60. [60]

    This performance gap reveals that our current MLLMs still have a long way to go in achieving truly small object understanding

    Predefined Small Object Category List: [‘holothurian’,‘echinus’,‘scal- lop’,‘starfish’,‘fish’,‘corals’,‘diver’,‘cuttlefish’,‘turtle’,‘jellyfish’].←If test ‘Underwater’ scenario, please notice it! object recognition capabilities in Aerial scenarios, with the gap reaching as high as 21.88%. This performance gap reveals that our current MLLMs still have a lo...