pith. machine review for the scientific record. sign in

arxiv: 2404.16821 · v2 · submitted 2024-04-25 · 💻 cs.CV

Recognition: 1 theorem link

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Authors on Pith no claims yet

Pith reviewed 2026-05-12 20:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelInternVLvision encoderdynamic high-resolutionbilingual datasetOCRbenchmarksopen-source
0
0 comments X

The pith

InternVL 1.5 reaches state-of-the-art on 8 of 18 multimodal benchmarks by strengthening its vision encoder, adding dynamic high-resolution tiling, and using a new bilingual dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternVL 1.5 as an open-source multimodal large language model meant to close the gap with proprietary systems such as GPT-4V. It applies three changes: continuous training of the InternViT-6B vision encoder to make it stronger and reusable, dynamic division of images into 1 to 40 tiles of 448 by 448 pixels so the model can accept up to 4K resolution, and collection of a high-quality bilingual dataset with English and Chinese question-answer pairs focused on common scenes and documents. On a set of 18 benchmarks the resulting model matches or exceeds both open-source and commercial models, with top scores in eight cases, especially on OCR and Chinese-language tasks.

Core claim

InternVL 1.5 shows that an open-source MLLM can achieve competitive or superior results against proprietary multimodal models by combining a continuously trained InternViT-6B vision encoder, dynamic high-resolution image tiling up to 4K, and a carefully curated high-quality bilingual dataset covering everyday scenes and documents.

What carries the argument

InternVL 1.5, which couples the InternViT-6B vision encoder with dynamic tiling of input images into variable numbers of 448x448 patches and bilingual question-answer supervision to support higher-resolution and multilingual visual understanding.

If this is right

  • Open-source multimodal models gain the ability to process high-resolution document and scene images without fixed resolution limits.
  • OCR accuracy and performance on Chinese-language visual tasks rise markedly from the bilingual training data.
  • Vision encoders can be trained once with continuous learning and then reused across multiple language models.
  • The performance gap between open and closed multimodal systems narrows on standard evaluation suites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Wider availability of capable open multimodal models could reduce dependence on commercial APIs for visual reasoning applications.
  • The dynamic tiling method offers a practical way to handle images of widely varying sizes and aspect ratios in future models.
  • Similar combinations of encoder scaling, resolution flexibility, and targeted data curation may extend to video or other modalities.

Load-bearing premise

The three improvements create genuine gains that generalize to new multimodal tasks rather than just improving scores on the particular benchmarks chosen for evaluation.

What would settle it

A new multimodal benchmark or real-world task set where InternVL 1.5 falls substantially behind the strongest proprietary models even after applying the same three improvements.

read the original abstract

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces InternVL 1.5, an open-source multimodal large language model (MLLM) that incorporates three targeted improvements: (1) continuous learning to strengthen the InternViT-6B vision encoder for better transferability across LLMs, (2) dynamic high-resolution processing that adaptively tiles input images into 1–40 patches of 448×448 pixels (supporting up to 4K resolution based on aspect ratio), and (3) a curated high-quality bilingual (English–Chinese) QA dataset focused on everyday scenes and document images. The central claim is that these changes enable InternVL 1.5 to achieve competitive performance against both open-source and proprietary models, attaining state-of-the-art results on 8 of 18 multimodal benchmarks.

Significance. If the reported benchmark results prove robust under detailed scrutiny, the work would be significant as a practical demonstration that modest, reusable enhancements in vision encoding, resolution handling, and bilingual data curation can substantially narrow the gap between open-source and commercial multimodal systems. The public code release and the emphasis on a transferable vision foundation model provide concrete assets for the community.

major comments (3)
  1. [Abstract] Abstract: The claim that InternVL 1.5 achieves state-of-the-art results on 8 of 18 benchmarks is presented without naming the benchmarks, reporting numerical scores, listing baselines, or referencing any results table or statistical test. This absence is load-bearing for the central performance claim and prevents verification of whether the three improvements genuinely close the gap to GPT-4V-scale models.
  2. [Evaluation] Evaluation section: No ablation studies or controlled experiments are described that isolate the contribution of the stronger vision encoder, the dynamic tiling mechanism, or the bilingual dataset on held-out tasks. Without such controls it remains unclear whether the reported gains reflect genuine generalization or benchmark-specific tuning.
  3. [Comparisons] Comparisons: The manuscript provides no explicit statement that proprietary baselines (GPT-4V, etc.) were evaluated under identical conditions—same prompts, decoding parameters, and image-resolution handling—as the proposed dynamic-tiling pipeline. Inconsistent protocols would undermine the claim of competitive or superior performance.
minor comments (2)
  1. [Abstract] The description of the dynamic tiling strategy (1–40 tiles of 448×448) would be clearer with an accompanying figure showing examples for different aspect ratios and resolutions.
  2. [Evaluation] Ensure every benchmark cited in the results is accompanied by its original reference and a brief description of the metric used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that InternVL 1.5 achieves state-of-the-art results on 8 of 18 benchmarks is presented without naming the benchmarks, reporting numerical scores, listing baselines, or referencing any results table or statistical test. This absence is load-bearing for the central performance claim and prevents verification of whether the three improvements genuinely close the gap to GPT-4V-scale models.

    Authors: We agree that the abstract is too concise to fully support the central claim. In the revised manuscript, we will expand the abstract to name the specific benchmarks achieving SOTA results, include key numerical scores and main baselines, and explicitly reference the primary results table for full verification. revision: yes

  2. Referee: [Evaluation] Evaluation section: No ablation studies or controlled experiments are described that isolate the contribution of the stronger vision encoder, the dynamic tiling mechanism, or the bilingual dataset on held-out tasks. Without such controls it remains unclear whether the reported gains reflect genuine generalization or benchmark-specific tuning.

    Authors: We acknowledge that dedicated ablations would better isolate each component's contribution. The submitted manuscript emphasizes overall benchmark comparisons rather than component-wise controls. In the revision, we will add a new subsection with controlled ablation experiments evaluating the impact of the stronger vision encoder, dynamic high-resolution tiling, and bilingual dataset on held-out tasks. revision: yes

  3. Referee: [Comparisons] Comparisons: The manuscript provides no explicit statement that proprietary baselines (GPT-4V, etc.) were evaluated under identical conditions—same prompts, decoding parameters, and image-resolution handling—as the proposed dynamic-tiling pipeline. Inconsistent protocols would undermine the claim of competitive or superior performance.

    Authors: We thank the referee for highlighting this methodological point. For open-source models we performed evaluations under consistent settings; for proprietary models we used officially published results, as identical re-evaluation is constrained by API access and policies. We will add an explicit paragraph in the Comparisons section detailing the protocols for each baseline type, including any differences in prompting or resolution handling, to ensure full transparency. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark claims rest on external evaluations independent of model definitions.

full rationale

The paper describes three concrete improvements (continuous learning on InternViT-6B, dynamic 1-40 tile high-res up to 4K, and new bilingual QA dataset) and reports measured performance on 18 external benchmarks, with SOTA on 8. No equations, derivations, or self-referential definitions exist. Performance numbers are not fitted parameters renamed as predictions, nor do they reduce by construction to the inputs. Self-citations to prior InternVL work are present but not load-bearing; the central claims are directly falsifiable via the cited benchmarks under standard protocols. This is a standard empirical release paper whose results do not collapse into definitional equivalence with its own training choices.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about scaling vision encoders and the value of curated data; no new free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (2)
  • domain assumption Continuous learning on a large vision foundation model improves transferable visual understanding.
    Invoked to justify the first improvement.
  • domain assumption Dynamic tiling of images into 448x448 patches preserves information for high-resolution inputs.
    Invoked to justify the second improvement.

pith-pipeline@v0.9.0 · 5658 in / 1341 out tokens · 98750 ms · 2026-05-12T20:53:52.362772+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

    cs.NE 2026-04 unverdicted novelty 8.0

    SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.

  2. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  3. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  4. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

  5. SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.

  6. EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.

  7. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    cs.CV 2025-01 unverdicted novelty 7.0

    Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

  8. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  9. Language-Conditioned Visual Grounding with CLIP Multilingual

    cs.CL 2026-05 unverdicted novelty 6.0

    Fixing the visual encoder in multilingual CLIP isolates text-branch deficits as the cause of lower visual grounding performance for low-resource languages, with model scaling widening some gaps but not others.

  10. SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

    cs.AI 2026-05 unverdicted novelty 6.0

    SPARK constructs unified knowledge graphs from multi-document scientific literature to ground self-play RL with asymmetric roles and verifiable rewards, outperforming flat-corpus baselines especially on longer-hop rea...

  11. CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

    cs.CV 2026-05 unverdicted novelty 6.0

    CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...

  12. PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

    cs.AI 2026-04 unverdicted novelty 6.0

    PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.

  13. Latent Denoising Improves Visual Alignment in Large Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.

  14. BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models

    cs.CY 2026-04 conditional novelty 6.0

    BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.

  15. ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

    cs.CV 2026-03 unverdicted novelty 6.0

    ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.

  16. DeepSeek-OCR: Contexts Optical Compression

    cs.CV 2025-10 unverdicted novelty 6.0

    DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.

  17. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  18. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  19. VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    cs.CV 2025-04 unverdicted novelty 6.0

    VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

  20. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  21. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  22. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  23. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    cs.CL 2024-04 accept novelty 6.0

    Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

  24. Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

    cs.CV 2026-05 unverdicted novelty 5.0

    ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.

  25. Make Your LVLM KV Cache More Lightweight

    cs.CV 2026-05 unverdicted novelty 5.0

    LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.

  26. Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

    cs.CV 2026-04 unverdicted novelty 5.0

    An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.

  27. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  28. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    cs.CL 2025-03 unverdicted novelty 5.0

    Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

  29. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  30. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

142 extracted references · 142 canonical work pages · cited by 30 Pith papers · 28 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 3, 6, 7

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022. 3

  3. [3]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com , 2024. 2, 3, 6, 7

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

  6. [6]

    Coig-cqia: Quality is all you need for chinese instruction fine-tuning

    Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, et al. Coig-cqia: Quality is all you need for chinese instruction fine-tuning. arXiv preprint arXiv:2403.18058, 2024. 5, 6

  7. [7]

    CoRR , volume =

    Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 3

  8. [8]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open- source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. 3

  9. [9]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 5

  10. [10]

    Coyo-700m: Image-text pair dataset, 2022

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset, 2022. 4, 5, 6

  11. [11]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 2, 3, 4, 6

  12. [12]

    An augmented benchmark dataset for geometric question answering through dual parallel text encoding

    Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511–1520, 2022. 5

  13. [13]

    Videollm: Modeling video sequence with large language models

    Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023. 3

  14. [14]

    H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Juny- ing Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jian- quan Li, Xiang Wan, and Benyou Wang. Allava: Harness- ing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024. 5, 6

  15. [15]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3

  16. [16]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 5, 7

  17. [17]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server. arXiv preprint arXiv:1504.00325, 2015. 4, 5

  18. [18]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 1, 2, 3, 4, 6

  19. [19]

    Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art

    Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art. In ICDAR, pages 1571–1576, 2019. 5

  20. [20]

    Simple and effec- tive multi-paragraph reading comprehension

    Christopher Clark and Matt Gardner. Simple and effec- tive multi-paragraph reading comprehension. InACL, pages 845–855, 2018. 5

  21. [21]

    Opencompass: A univer- sal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: / / github . com / open - compass / opencompass,

  22. [22]

    Visual dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, pages 326–335, 2017. 5

  23. [23]

    Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 1, 2, 3, 4, 7

  24. [24]

    InternLM- XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2- 4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024. 3 13

  25. [25]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335, 2022. 3

  26. [26]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive eval- uation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 6, 7, 8

  27. [27]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010,

  28. [28]

    Making the v in vqa matter: El- evating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017. 5

  29. [29]

    Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

    Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. NeurIPS, 35: 26418–26431, 2022. 4, 5

  30. [30]

    HallusionBench: An advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language halluci- nation & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023. 6, 7

  31. [31]

    Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models

    Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Dahua Lin. Wanjuan: A comprehensive multimodal dataset for ad- vancing english and chinese large models. arXiv preprint arXiv:2308.10755, 2023. 5

  32. [32]

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914 ,

  33. [33]

    mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

    Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mplug-docowl 1.5: Unified structure learn- ing for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024. 3, 6, 7

  34. [34]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 5

  35. [35]

    Introducing hpt: A family of leading multimodal llms

    HyperGAI Research Team. Introducing hpt: A family of leading multimodal llms. https://www.hypergai. com/blog/introducing- hpt- a- family- of- leading-multimodal-llms, 2024. 3, 6

  36. [36]

    Ha and J

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip. Zenodo. Version 0.1. https://doi.org/10. 5281/zenodo.5143773 , 2021. DOI: 10.5281/zen- odo.5143773. 4

  37. [37]

    Textocr gpt-4v dataset

    Jimmycarter. Textocr gpt-4v dataset. https : //huggingface.co/datasets/jimmycarter/ textocr-gpt4v, 2023. 5

  38. [38]

    Dvqa: Understanding data visualizations via ques- tion answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. In CVPR, pages 5648–5656, 2018. 5

  39. [39]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016. 5, 6, 7

  40. [40]

    Are you smarter than a sixth grader? textbook question an- swering for multimodal machine comprehension

    Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question an- swering for multimodal machine comprehension. InCVPR, pages 4999–5007, 2017. 5

  41. [41]

    Ocr- free document understanding transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr- free document understanding transformer. In ECCV, 2022. 5

  42. [42]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017. 5

  43. [43]

    Lisa: Reasoning segmentation via large language model,

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning seg- mentation via large language model. arXiv preprint arXiv:2308.00692, 2023. 3

  44. [44]

    Gpt-4v dataset

    LAION. Gpt-4v dataset. https://huggingface. co/datasets/laion/gpt4v-dataset, 2023. 5

  45. [45]

    Viquae, a dataset for knowledge-based visual question answering about named entities

    Paul Lerner, Olivier Ferret, Camille Guinaudeau, Herv ´e Le Borgne, Romaric Besanc ¸on, Jos´e G Moreno, and Jes ´us Lov´on Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In SIGIR, pages 3108–3120, 2022. 5

  46. [46]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 6, 7

  47. [47]

    Otterhd: A high-resolution multi- modality model

    Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model. arXiv preprint arXiv:2311.04219 , 2023. 3

  48. [48]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3

  49. [49]

    PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system.arXiv preprint arXiv:2206.03001, 2022

    Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, et al. Pp-ocrv3: More attempts for the im- provement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022. 5

  50. [50]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023. 7 14

  51. [51]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 3

  52. [52]

    Mvitv2: Improved multiscale vision transformers for classification and detection

    Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- hofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, pages 4804–4814,

  53. [53]

    Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,

  54. [54]

    Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

    Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. In CVPR, pages 14963–14973, 2023. 5

  55. [55]

    Monkey: Image resolution and text label are important things for large multi-modal models

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 3

  56. [56]

    Moe-llava: Mix- ture of experts for large vision-language models

    Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024. 3

  57. [57]

    Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

    Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 3

  58. [58]

    Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning

    Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 5

  59. [59]

    Visual spatial reasoning

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. TACL, 11:635–651, 2023. 5

  60. [60]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 5

  61. [61]

    Mmc: Ad- vancing multimodal chart understanding with large-scale instruc- tion tuning

    Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understand- ing with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023. 5

  62. [62]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 1, 2, 3, 4, 7

  63. [63]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2023. 1, 5, 6

  64. [64]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2, 3, 4, 6, 7, 8

  65. [65]

    arXiv preprint arXiv:2403.20194 , year=

    Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, et al. Convbench: A multi-turn conversation eval- uation benchmark with hierarchical capability for large vision-language models. arXiv preprint arXiv:2403.20194,

  66. [66]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 6, 7

  67. [67]

    On the hidden mystery of ocr in large multimodal models

    Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mys- tery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023. 6, 7, 8

  68. [68]

    Textmonkey: Anocr-freelargemultimodal model for understanding document.arXiv preprint arXiv:2403.04473,

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document.arXiv preprint arXiv:2403.04473, 2024. 3, 6, 7

  69. [69]

    Internchat: Solving vision-centric tasks by interacting with chatbots beyond language

    Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond lan- guage. arXiv preprint arXiv:2305.05662, 2023. 3

  70. [70]

    Controlllm: Augment language models with tools by searching on graphs

    Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, and Wenhai Wang. Controlllm: Augment language models with tools by searching on graphs. arXiv preprint arXiv:2310.17796, 2023. 3

  71. [71]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Yaofeng Sun, et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024. 2, 3, 7

  72. [72]

    Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165,

  73. [73]

    Learn to explain: Multimodal rea- soning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answering. NeurIPS, 35:2507–2521, 2022. 5

  74. [74]

    Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

    Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song- Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022. 5

  75. [75]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts. arXiv preprint arXiv:2310.02255, 2023. 6, 7

  76. [76]

    Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models

    Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xi- aoshuai Sun, and Rongrong Ji. Feast your eyes: Mixture-of- 15 resolution adaptation for multimodal large language mod- els. arXiv preprint arXiv:2403.03003, 2024. 3

  77. [77]

    Kosmos-2.5: A multimodal liter- ate model

    Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shum- ing Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, et al. Kosmos-2.5: A multimodal liter- ate model. arXiv preprint arXiv:2309.11419, 2023. 3

  78. [78]

    arXiv preprint arXiv:2404.13013 (2024)

    Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint arXiv:2404.13013, 2024. 4

  79. [79]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20, 2016. 5

  80. [80]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019. 5

Showing first 80 references.