pith. sign in

arxiv: 2604.17889 · v1 · submitted 2026-04-20 · 💻 cs.CV

AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning

Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords aerial visual question answeringscene graphsretrieval-augmented generationmultimodal large language modelsstructured visual knowledgefine-grained reasoningobject relations
0
0 comments X

The pith

Scene graphs and targeted retrieval create better prompts than dense visual tokens for aerial visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that aerial visual question answering suffers when models reason directly over dense visual tokens because key evidence often resides in small objects, explicit counts, coarse positions, and object relations. AeroRAG addresses this by first turning the image into a scene graph that records categories, quantities, locations, and semantic links, then retrieving only the query-relevant portions to build compact text prompts for a language model. This explicit intermediate layer is shown to deliver measurable gains over six baseline multimodal models on aerial data and on a general benchmark, with the biggest lifts occurring precisely where dense-token methods struggle. The work also checks that the same interface works on standard visual question answering tasks, suggesting the structured route is not limited to aerial imagery.

Core claim

AeroRAG converts an input image into structured visual knowledge that includes object categories, quantities, spatial locations, and semantic relations, then retrieves query-relevant semantic chunks to construct compact prompts for a text-based large language model. On the AUG aerial dataset and the VG-150 benchmark this produces consistent improvements over six strong MLLM baselines, with the largest gains in dense aerial scenes and relation-sensitive reasoning tasks. The same interface remains compatible when evaluated on VQAv2, indicating that structured retrieval offers a practical alternative to direct dense-token reasoning.

What carries the argument

The scene-graph-guided multimodal retrieval-augmented generation framework, which extracts explicit semantic structures from an image and retrieves only the relevant chunks to form language-model prompts.

If this is right

  • Performance improves most on dense aerial scenes and on questions that hinge on object relations or counts.
  • The structured interface remains effective when applied to general-domain visual question answering benchmarks.
  • Replacing dense visual tokens with retrieved semantic chunks yields a more compact and interpretable prompt format.
  • The method supports deployment-oriented systems that need grounded reasoning without full image token transmission.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval step could be adapted to other domains where spatial counts and relations dominate, such as traffic monitoring or warehouse robotics.
  • Because the language model receives only selected semantic chunks, the approach may reduce token usage and latency in real-time aerial applications.
  • Improving the upstream scene-graph extractor would directly strengthen the entire pipeline, since errors at extraction propagate to the retrieved prompt.

Load-bearing premise

Converting an input image into structured visual knowledge via scene graphs accurately captures and preserves task-critical evidence carried by small objects, explicit quantities, coarse locations, and inter-object relations without significant loss.

What would settle it

An experiment on a held-out set of dense aerial images in which scene-graph extraction repeatedly drops small but decision-critical objects, after which the retrieval-augmented model shows no accuracy gain or outright loss compared with the dense-token baselines.

Figures

Figures reproduced from arXiv: 2604.17889 by Junxiao Xue, Meicong Si, Quan Deng, Tingqi Hu, Xinyi Yin, Xuecheng Wu, Yunyun Shi.

Figure 1
Figure 1. Figure 1: The qualitative comparison on challenging aerial VQA examples. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the enhanced multimodal RAG-LLM framework for visual question answering, comprising three core components: (1) Multimodal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on the Top-k retrieval size. The F1-scores across all visual attributes peak at k = 4, demonstrating the optimal balance between providing explicit structural grounding and avoiding visual noise in the text prompt. standard visual reasoning scenarios. 3) Results on the VQAv2 Dataset: To further validate the generalizability of our interface beyond specialized aerospace data, we evaluate its … view at source ↗
read the original abstract

Despite recent progress in multimodal large language models (MLLMs), reliable visual question answering in aerial scenes remains challenging. In such scenes, task-critical evidence is often carried by small objects, explicit quantities, coarse locations, and inter-object relations, whereas conventional dense visual-token representations are not well aligned with these structured semantics. To address this interface mismatch, we propose AeroRAG, a scene-graph-guided multimodal retrieval-augmented generation framework for visual question answering. The framework first converts an input image into structured visual knowledge, including object categories, quantities, spatial locations, and semantic relations, and then retrieves query-relevant semantic chunks to construct compact prompts for a text-based large language model. Rather than relying on direct reasoning over dense visual tokens, our method introduces a more explicit intermediate interface between perception and language reasoning. Experiments on the AUG aerial dataset and the general-domain VG-150 benchmark show consistent improvements over six strong MLLM baselines, with the largest gains observed in dense aerial scenes and relation-sensitive reasoning. We further evaluate the framework on VQAv2 to verify that the proposed interface remains compatible with standard visual reasoning settings. These results suggest that structured retrieval is a practical design direction for deployment-oriented and grounded visual reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes AeroRAG, a scene-graph-guided multimodal retrieval-augmented generation framework for fine-grained visual question answering in aerial scenes. The method first converts an input image into structured visual knowledge (object categories, quantities, spatial locations, and semantic relations) via a scene-graph pipeline, retrieves query-relevant semantic chunks, and constructs compact prompts for a text-based LLM. Experiments on the AUG aerial dataset and the VG-150 benchmark are reported to show consistent improvements over six strong MLLM baselines, with the largest gains in dense aerial scenes and relation-sensitive reasoning; compatibility is further checked on VQAv2.

Significance. If the quantitative results and attribution to the structured interface hold, the work would demonstrate that an explicit scene-graph intermediate representation can better align with task-critical evidence (small objects, quantities, coarse locations, inter-object relations) than dense visual tokens in MLLMs. This offers a practical design direction for grounded, deployment-oriented aerial VQA systems and could influence future multimodal reasoning pipelines that prioritize structured semantics over raw visual embeddings.

major comments (3)
  1. [Abstract] Abstract: the central claim of 'consistent improvements over six strong MLLM baselines' with 'largest gains observed in dense aerial scenes and relation-sensitive reasoning' is stated without any numerical metrics, error bars, statistical tests, or references to tables/figures. This omission makes it impossible to verify whether the data supports the performance claims or the attribution to the structured interface.
  2. [Method] Method description (scene-graph conversion step): the framework's core assumption is that converting the image into object categories, quantities, coarse locations, and semantic relations 'retains task-critical evidence' without material loss. No quantitative evaluation of this extraction step (e.g., precision/recall on small objects or relation F1 scores) is provided on aerial imagery, where standard detectors trained on natural-image distributions may fail due to tiny, densely packed objects and viewpoint ambiguities. If extraction precision is low, downstream retrieval and LLM prompting cannot compensate, undermining the claim that gains arise from the structured interface rather than incidental factors.
  3. [Experiments] Experiments section: the comparison against six MLLM baselines lacks details on whether baselines received equivalent prompt engineering, retrieval augmentation, or scene-graph inputs. Without such controls or ablations isolating the contribution of the scene-graph-guided RAG component, the reported gains on AUG and VG-150 cannot be confidently attributed to the proposed interface.
minor comments (1)
  1. [Abstract] The abstract mentions evaluation on VQAv2 'to verify compatibility' but supplies no results or metrics for this experiment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'consistent improvements over six strong MLLM baselines' with 'largest gains observed in dense aerial scenes and relation-sensitive reasoning' is stated without any numerical metrics, error bars, statistical tests, or references to tables/figures. This omission makes it impossible to verify whether the data supports the performance claims or the attribution to the structured interface.

    Authors: We agree that the abstract should include supporting quantitative details. In the revision we will add specific metrics (e.g., average accuracy improvements on AUG and VG-150) and explicit references to the relevant tables and figures that report the results and subset analyses. revision: yes

  2. Referee: [Method] Method description (scene-graph conversion step): the framework's core assumption is that converting the image into object categories, quantities, coarse locations, and semantic relations 'retains task-critical evidence' without material loss. No quantitative evaluation of this extraction step (e.g., precision/recall on small objects or relation F1 scores) is provided on aerial imagery, where standard detectors trained on natural-image distributions may fail due to tiny, densely packed objects and viewpoint ambiguities. If extraction precision is low, downstream retrieval and LLM prompting cannot compensate, undermining the claim that gains arise from the structured interface rather than incidental factors.

    Authors: We acknowledge that a direct quantitative assessment of the scene-graph extraction would strengthen attribution. We will add a new evaluation subsection reporting object detection precision/recall and relation F1 scores on aerial imagery to quantify fidelity of the structured representation. revision: yes

  3. Referee: [Experiments] Experiments section: the comparison against six MLLM baselines lacks details on whether baselines received equivalent prompt engineering, retrieval augmentation, or scene-graph inputs. Without such controls or ablations isolating the contribution of the scene-graph-guided RAG component, the reported gains on AUG and VG-150 cannot be confidently attributed to the proposed interface.

    Authors: We will revise the experiments section to explicitly document that baselines were evaluated in their standard configurations without retrieval or scene-graph augmentation. We will also add ablation results that isolate the scene-graph-guided RAG component to better attribute the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework evaluated on external benchmarks

full rationale

The paper presents AeroRAG as an empirical method that converts aerial images to scene graphs (object categories, quantities, locations, relations) then retrieves semantic chunks to build LLM prompts. Performance is measured via experiments on the external AUG aerial dataset, VG-150, and VQAv2 against six MLLM baselines. No equations, derivations, or first-principles claims appear that reduce to the inputs by construction. No fitted parameters are relabeled as predictions, no self-citation chains justify uniqueness or ansatzes, and the central interface claim is tested rather than assumed tautologically. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that scene graphs provide a lossless enough representation of aerial scene semantics for downstream retrieval and LLM prompting.

axioms (1)
  • domain assumption Scene graphs can accurately encode task-critical evidence including small objects, quantities, locations, and inter-object relations in aerial imagery.
    Invoked in the description of converting input images to structured visual knowledge.
invented entities (1)
  • AeroRAG framework no independent evidence
    purpose: To create an explicit intermediate structured interface between perception and language reasoning for aerial VQA.
    Newly proposed system whose effectiveness is asserted via claimed experiments.

pith-pipeline@v0.9.0 · 5538 in / 1304 out tokens · 46902 ms · 2026-05-10T04:42:37.612959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, et al., “GPT-4 technical report,”arXiv:2303.08774, 2023

  2. [2]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv:2403.05530, 2024

  3. [3]

    Visual instruction tuning,

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, “Visual instruction tuning,”Advances in Neural Information Processing Systems, vol. 36, pp. 34892–34916, 2023

  4. [4]

    Prompting large language models with answer heuristics for knowledge-based visual question answering,

    Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu, “Prompting large language models with answer heuristics for knowledge-based visual question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 14974– 14983

  5. [5]

    Exploring diverse in-context configurations for image cap- tioning,

    Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng, “Exploring diverse in-context configurations for image cap- tioning,”Advances in Neural Information Processing Systems, vol. 36, pp. 40924–40943, 2023

  6. [6]

    Avf-mae++: Scaling affective video facial masked autoencoders via efficient audio-visual self- supervised learning,

    Xuecheng Wu, Heli Sun, Yifan Wang, Jiayu Nie, Jie Zhang, Yabing Wang, Junxiao Xue, and Liang He, “Avf-mae++: Scaling affective video facial masked autoencoders via efficient audio-visual self- supervised learning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9142–9153

  7. [7]

    Scalable audio-visual masked autoencoders for efficient affective video facial analysis,

    Xuecheng Wu, Junxiao Xue, Xinyi Yin, Yunyun Shi, Liangyu Fu, Danlei Huang, Yifan Wang, Jia Zhang, Jiayu Nie, and Jun Wang, “Scalable audio-visual masked autoencoders for efficient affective video facial analysis,”arXiv preprint arXiv:2509.24214, 2025

  8. [8]

    Vic-bench: Benchmarking visual- interleaved chain-of-thought capability in mllms with free-style intermediate state representations

    Xuecheng Wu, Jiaxing Liu, Danlei Huang, Yifan Wang, Yunyun Shi, Kedi Chen, Junxiao Xue, Yang Liu, Chunlin Chen, Hairong Dong, et al., “Vic-bench: Benchmarking visual-interleaved chain-of-thought capability in mllms with free-style intermediate state representations,” arXiv preprint arXiv:2505.14404, 2025

  9. [9]

    Multimodal large language models: A survey,

    Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Philip S Yu, “Multimodal large language models: A survey,” in2023 IEEE International Conference on Big Data (BigData). IEEE, 2023, pp. 2247–2256

  10. [10]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023

  11. [11]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al., “The llama 3 herd of models,”arXiv:2407.21783, 2024

  12. [12]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  13. [13]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, pp. 6, 2023

  14. [14]

    Learning transferable visual models from natural language supervision,

    Alec Radford, Jong Wook Kim, Chris Hallacy, et al., “Learning transferable visual models from natural language supervision,” inPro- ceedings of the 38th International Conference on Machine Learning, 2021, pp. 8748–8763

  15. [15]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12888–12900

  16. [16]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

  17. [17]

    BLIP- 2: Bootstrapping language-image pre-training with frozen image en- coders and large language models,

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, “BLIP- 2: Bootstrapping language-image pre-training with frozen image en- coders and large language models,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 19730– 19742

  18. [18]

    Adapter is all you need for tuning visual tasks.arXiv preprint arXiv:2311.15010, 2023

    Dongshuo Yin, Leiyi Hu, Bin Li, and Youqun Zhang, “Adapter is all you need for tuning visual tasks,”arXiv preprint arXiv:2311.15010, 2023

  19. [19]

    Lora: Low-rank adaptation of large language models.,

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al., “Lora: Low-rank adaptation of large language models.,”ICLR, vol. 1, no. 2, pp. 3, 2022

  20. [20]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang, “Prefix-tuning: Optimizing continuous prompts for generation,”arXiv preprint arXiv:2101.00190, 2021

  21. [21]

    Parameter-efficient transfer learning for nlp,

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly, “Parameter-efficient transfer learning for nlp,” in International conference on machine learning. PMLR, 2019, pp. 2790–2799

  22. [22]

    A survey on RAG meeting LLMs: Towards retrieval-augmented large language models,

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li, “A survey on RAG meeting LLMs: Towards retrieval-augmented large language models,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 6491–6501

  23. [23]

    Generative AI service implementation using LLM application architecture: based on RAG model and LangChain frame- work,

    Cheonsu Jeong, “Generative AI service implementation using LLM application architecture: based on RAG model and LangChain frame- work,”Journal of Intelligence and Information Systems, vol. 29, no. 4, pp. 129–164, 2023

  24. [24]

    Fine-grained retrieval- augmented generation for visual question answering,

    Zhengxuan Zhang, Yin Wu, Yuyu Luo, and Nan Tang, “Fine-grained retrieval-augmented generation for visual question answering,”arXiv preprint arXiv:2502.20964, 2025

  25. [25]

    Ragcap: Retrieval-augmented generation for style-aware remote sensing image captioning without fine-tuning,

    Yakoub Bazi, Mohamad M Al Rahhal, and Mansour Zuair, “Ragcap: Retrieval-augmented generation for style-aware remote sensing image captioning without fine-tuning,”Authorea Preprints, 2025

  26. [26]

    Towards retrieval-augmented architectures for image captioning,

    Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, and Rita Cucchiara, “Towards retrieval-augmented architectures for image captioning,”ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 8, pp. 1–22, 2024

  27. [27]

    Retrieval augmented generation and understanding in vision: A sur- vey and new outlook.arXiv preprint arXiv:2503.18016,

    Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Lutao Jiang, Haiwei Xue, Bin Ren, Danda Paudel, Nicu Sebe, Luc Van Gool, and Xuming Hu, “Retrieval augmented generation and understanding in vision: A survey and new outlook,”arXiv preprint arXiv:2503.18016, 2025

  28. [28]

    Rag beyond text: Enhancing image retrieval in rag systems,

    Sukanya Bag, Ayushman Gupta, Rajat Kaushik, and Chirag Jain, “Rag beyond text: Enhancing image retrieval in rag systems,” in 2024 International Conference on Electrical, Computer and Energy Technologies (ICECET. IEEE, 2024, pp. 1–6

  29. [29]

    Over- coming llm challenges using rag-driven precision in coffee leaf dis- ease remediation,

    S Selva Kumar, Afifah Khan Mohammed Ajmal Khan, Imadh Ajaz Banday, Manikantha Gada, and Vibha Venkatesh Shanbhag, “Over- coming llm challenges using rag-driven precision in coffee leaf dis- ease remediation,” in2024 International Conference on Emerging Technologies in Computer Science for Interdisciplinary Applications (ICETCS), 2024, pp. 1–6

  30. [30]

    Fast R-CNN,

    Ross Girshick, “Fast R-CNN,” inProceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV), 2015, pp. 1440–1448

  31. [31]

    Prototype-based embedding network for scene graph generation,

    Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song, “Prototype-based embedding network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22783–22792

  32. [32]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou, “Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond,”arXiv:2308.12966, 2023

  33. [33]

    Lamm: Language- assisted multi-modal instruction-tuning dataset, framework, and benchmark

    Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, et al., “Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark,”arXiv preprint arXiv:2306.06687, 2023

  34. [34]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al., “Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling,”arXiv preprint arXiv:2412.05271, 2024

  35. [35]

    Qvq: To see the world with wisdom,

    Qwen Team, “Qvq: To see the world with wisdom,” December 2024

  36. [36]

    AUG: A new dataset and an efficient model for aerial image urban scene graph generation,

    Yansheng Li, Kun Li, Yongjun Zhang, Linlin Wang, and Dingwen Zhang, “AUG: A new dataset and an efficient model for aerial image urban scene graph generation,”arXiv:2404.07788, 2024

  37. [37]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations,

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017

  38. [38]

    Scene graph generation by iterative message passing,

    Danfei Xu, Yuke Zhu, Christopher B Choy, and Fei-Fei Li, “Scene graph generation by iterative message passing,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5410–5419

  39. [39]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering,

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913

  40. [40]

    Ofa: Uni- fying architectures, tasks, and modalities through a simple sequence- to-sequence learning framework,

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang, “Ofa: Uni- fying architectures, tasks, and modalities through a simple sequence- to-sequence learning framework,” inInternational conference on machine learning. PMLR, 2022, pp. 23318–23340

  41. [41]

    Image as a foreign language: Beit pretraining for vision and vision-language tasks,

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Sing- hal, Subhojit Som, et al., “Image as a foreign language: Beit pretraining for vision and vision-language tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19175–19186

  42. [42]

    ONE-PEACE: exploring one general representation model toward unlimited modalities

    Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jin- gren Zhou, Xinggang Wang, and Chang Zhou, “One-peace: Exploring one general representation model toward unlimited modalities,”arXiv preprint arXiv:2305.11172, 2023

  43. [43]

    mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections

    Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, et al., “mplug: Effective and efficient vision-language learning by cross-modal skip- connections,”arXiv preprint arXiv:2205.12005, 2022