Recognition: 1 theorem link
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Pith reviewed 2026-05-12 20:53 UTC · model grok-4.3
The pith
InternVL 1.5 reaches state-of-the-art on 8 of 18 multimodal benchmarks by strengthening its vision encoder, adding dynamic high-resolution tiling, and using a new bilingual dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InternVL 1.5 shows that an open-source MLLM can achieve competitive or superior results against proprietary multimodal models by combining a continuously trained InternViT-6B vision encoder, dynamic high-resolution image tiling up to 4K, and a carefully curated high-quality bilingual dataset covering everyday scenes and documents.
What carries the argument
InternVL 1.5, which couples the InternViT-6B vision encoder with dynamic tiling of input images into variable numbers of 448x448 patches and bilingual question-answer supervision to support higher-resolution and multilingual visual understanding.
If this is right
- Open-source multimodal models gain the ability to process high-resolution document and scene images without fixed resolution limits.
- OCR accuracy and performance on Chinese-language visual tasks rise markedly from the bilingual training data.
- Vision encoders can be trained once with continuous learning and then reused across multiple language models.
- The performance gap between open and closed multimodal systems narrows on standard evaluation suites.
Where Pith is reading between the lines
- Wider availability of capable open multimodal models could reduce dependence on commercial APIs for visual reasoning applications.
- The dynamic tiling method offers a practical way to handle images of widely varying sizes and aspect ratios in future models.
- Similar combinations of encoder scaling, resolution flexibility, and targeted data curation may extend to video or other modalities.
Load-bearing premise
The three improvements create genuine gains that generalize to new multimodal tasks rather than just improving scores on the particular benchmarks chosen for evaluation.
What would settle it
A new multimodal benchmark or real-world task set where InternVL 1.5 falls substantially behind the strongest proprietary models even after applying the same three improvements.
read the original abstract
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InternVL 1.5, an open-source multimodal large language model (MLLM) that incorporates three targeted improvements: (1) continuous learning to strengthen the InternViT-6B vision encoder for better transferability across LLMs, (2) dynamic high-resolution processing that adaptively tiles input images into 1–40 patches of 448×448 pixels (supporting up to 4K resolution based on aspect ratio), and (3) a curated high-quality bilingual (English–Chinese) QA dataset focused on everyday scenes and document images. The central claim is that these changes enable InternVL 1.5 to achieve competitive performance against both open-source and proprietary models, attaining state-of-the-art results on 8 of 18 multimodal benchmarks.
Significance. If the reported benchmark results prove robust under detailed scrutiny, the work would be significant as a practical demonstration that modest, reusable enhancements in vision encoding, resolution handling, and bilingual data curation can substantially narrow the gap between open-source and commercial multimodal systems. The public code release and the emphasis on a transferable vision foundation model provide concrete assets for the community.
major comments (3)
- [Abstract] Abstract: The claim that InternVL 1.5 achieves state-of-the-art results on 8 of 18 benchmarks is presented without naming the benchmarks, reporting numerical scores, listing baselines, or referencing any results table or statistical test. This absence is load-bearing for the central performance claim and prevents verification of whether the three improvements genuinely close the gap to GPT-4V-scale models.
- [Evaluation] Evaluation section: No ablation studies or controlled experiments are described that isolate the contribution of the stronger vision encoder, the dynamic tiling mechanism, or the bilingual dataset on held-out tasks. Without such controls it remains unclear whether the reported gains reflect genuine generalization or benchmark-specific tuning.
- [Comparisons] Comparisons: The manuscript provides no explicit statement that proprietary baselines (GPT-4V, etc.) were evaluated under identical conditions—same prompts, decoding parameters, and image-resolution handling—as the proposed dynamic-tiling pipeline. Inconsistent protocols would undermine the claim of competitive or superior performance.
minor comments (2)
- [Abstract] The description of the dynamic tiling strategy (1–40 tiles of 448×448) would be clearer with an accompanying figure showing examples for different aspect ratios and resolutions.
- [Evaluation] Ensure every benchmark cited in the results is accompanied by its original reference and a brief description of the metric used.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that InternVL 1.5 achieves state-of-the-art results on 8 of 18 benchmarks is presented without naming the benchmarks, reporting numerical scores, listing baselines, or referencing any results table or statistical test. This absence is load-bearing for the central performance claim and prevents verification of whether the three improvements genuinely close the gap to GPT-4V-scale models.
Authors: We agree that the abstract is too concise to fully support the central claim. In the revised manuscript, we will expand the abstract to name the specific benchmarks achieving SOTA results, include key numerical scores and main baselines, and explicitly reference the primary results table for full verification. revision: yes
-
Referee: [Evaluation] Evaluation section: No ablation studies or controlled experiments are described that isolate the contribution of the stronger vision encoder, the dynamic tiling mechanism, or the bilingual dataset on held-out tasks. Without such controls it remains unclear whether the reported gains reflect genuine generalization or benchmark-specific tuning.
Authors: We acknowledge that dedicated ablations would better isolate each component's contribution. The submitted manuscript emphasizes overall benchmark comparisons rather than component-wise controls. In the revision, we will add a new subsection with controlled ablation experiments evaluating the impact of the stronger vision encoder, dynamic high-resolution tiling, and bilingual dataset on held-out tasks. revision: yes
-
Referee: [Comparisons] Comparisons: The manuscript provides no explicit statement that proprietary baselines (GPT-4V, etc.) were evaluated under identical conditions—same prompts, decoding parameters, and image-resolution handling—as the proposed dynamic-tiling pipeline. Inconsistent protocols would undermine the claim of competitive or superior performance.
Authors: We thank the referee for highlighting this methodological point. For open-source models we performed evaluations under consistent settings; for proprietary models we used officially published results, as identical re-evaluation is constrained by API access and policies. We will add an explicit paragraph in the Comparisons section detailing the protocols for each baseline type, including any differences in prompting or resolution handling, to ensure full transparency. revision: yes
Circularity Check
No circularity; empirical benchmark claims rest on external evaluations independent of model definitions.
full rationale
The paper describes three concrete improvements (continuous learning on InternViT-6B, dynamic 1-40 tile high-res up to 4K, and new bilingual QA dataset) and reports measured performance on 18 external benchmarks, with SOTA on 8. No equations, derivations, or self-referential definitions exist. Performance numbers are not fitted parameters renamed as predictions, nor do they reduce by construction to the inputs. Self-citations to prior InternVL work are present but not load-bearing; the central claims are directly falsifiable via the cited benchmarks under standard protocols. This is a standard empirical release paper whose results do not collapse into definitional equivalence with its own training choices.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Continuous learning on a large vision foundation model improves transferable visual understanding.
- domain assumption Dynamic tiling of images into 448x448 patches preserves information for high-resolution inputs.
Forward citations
Cited by 30 Pith papers
-
SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression
SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
-
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
Language-Conditioned Visual Grounding with CLIP Multilingual
Fixing the visual encoder in multilingual CLIP isolates text-branch deficits as the cause of lower visual grounding performance for low-resource languages, with model scaling widening some gaps but not others.
-
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs
SPARK constructs unified knowledge graphs from multi-document scientific literature to ground self-play RL with asymmetric roles and verifiable rewards, outperforming flat-corpus baselines especially on longer-hop rea...
-
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
-
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.
-
Latent Denoising Improves Visual Alignment in Large Multimodal Models
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
-
BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models
BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.
-
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
-
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.
-
Make Your LVLM KV Cache More Lightweight
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
-
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022. 3
work page 2022
-
[3]
The claude 3 model family: Opus, sonnet, haiku
Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com , 2024. 2, 3, 6, 7
work page 2024
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Coig-cqia: Quality is all you need for chinese instruction fine-tuning
Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, et al. Coig-cqia: Quality is all you need for chinese instruction fine-tuning. arXiv preprint arXiv:2403.18058, 2024. 5, 6
-
[7]
Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 3
-
[8]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open- source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Scene text visual question answering
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 5
work page 2019
-
[10]
Coyo-700m: Image-text pair dataset, 2022
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset, 2022. 4, 5, 6
work page 2022
-
[11]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 2, 3, 4, 6
work page internal anchor Pith review arXiv 2024
-
[12]
An augmented benchmark dataset for geometric question answering through dual parallel text encoding
Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511–1520, 2022. 5
work page 2022
-
[13]
Videollm: Modeling video sequence with large language models
Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023. 3
-
[14]
H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B
Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Juny- ing Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jian- quan Li, Xiang Wan, and Benyou Wang. Allava: Harness- ing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024. 5, 6
-
[15]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 5, 7
work page internal anchor Pith review arXiv 2023
-
[17]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server. arXiv preprint arXiv:1504.00325, 2015. 4, 5
work page internal anchor Pith review arXiv 2015
-
[18]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 1, 2, 3, 4, 6
work page internal anchor Pith review arXiv 2023
-
[19]
Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art
Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art. In ICDAR, pages 1571–1576, 2019. 5
work page 2019
-
[20]
Simple and effec- tive multi-paragraph reading comprehension
Christopher Clark and Matt Gardner. Simple and effec- tive multi-paragraph reading comprehension. InACL, pages 845–855, 2018. 5
work page 2018
-
[21]
Opencompass: A univer- sal evaluation platform for foundation models
OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: / / github . com / open - compass / opencompass,
-
[22]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, pages 326–335, 2017. 5
work page 2017
-
[23]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 1, 2, 3, 4, 7
-
[24]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2- 4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024. 3 13
-
[25]
Glm: General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335, 2022. 3
work page 2022
-
[26]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive eval- uation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010,
work page internal anchor Pith review arXiv
-
[28]
Making the v in vqa matter: El- evating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017. 5
work page 2017
-
[29]
Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark
Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. NeurIPS, 35: 26418–26431, 2022. 4, 5
work page 2022
-
[30]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language halluci- nation & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023. 6, 7
-
[31]
Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models
Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Dahua Lin. Wanjuan: A comprehensive multimodal dataset for ad- vancing english and chinese large models. arXiv preprint arXiv:2308.10755, 2023. 5
-
[32]
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914 ,
-
[33]
mplug-docowl 1.5: Unified structure learning for ocr-free document understanding
Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mplug-docowl 1.5: Unified structure learn- ing for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024. 3, 6, 7
-
[34]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 5
work page 2019
-
[35]
Introducing hpt: A family of leading multimodal llms
HyperGAI Research Team. Introducing hpt: A family of leading multimodal llms. https://www.hypergai. com/blog/introducing- hpt- a- family- of- leading-multimodal-llms, 2024. 3, 6
work page 2024
-
[36]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip. Zenodo. Version 0.1. https://doi.org/10. 5281/zenodo.5143773 , 2021. DOI: 10.5281/zen- odo.5143773. 4
-
[37]
Jimmycarter. Textocr gpt-4v dataset. https : //huggingface.co/datasets/jimmycarter/ textocr-gpt4v, 2023. 5
work page 2023
-
[38]
Dvqa: Understanding data visualizations via ques- tion answering
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. In CVPR, pages 5648–5656, 2018. 5
work page 2018
-
[39]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016. 5, 6, 7
work page 2016
-
[40]
Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question an- swering for multimodal machine comprehension. InCVPR, pages 4999–5007, 2017. 5
work page 2017
-
[41]
Ocr- free document understanding transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr- free document understanding transformer. In ECCV, 2022. 5
work page 2022
-
[42]
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017. 5
work page 2017
-
[43]
Lisa: Reasoning segmentation via large language model,
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning seg- mentation via large language model. arXiv preprint arXiv:2308.00692, 2023. 3
-
[44]
LAION. Gpt-4v dataset. https://huggingface. co/datasets/laion/gpt4v-dataset, 2023. 5
work page 2023
-
[45]
Viquae, a dataset for knowledge-based visual question answering about named entities
Paul Lerner, Olivier Ferret, Camille Guinaudeau, Herv ´e Le Borgne, Romaric Besanc ¸on, Jos´e G Moreno, and Jes ´us Lov´on Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In SIGIR, pages 3108–3120, 2022. 5
work page 2022
-
[46]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Otterhd: A high-resolution multi- modality model
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model. arXiv preprint arXiv:2311.04219 , 2023. 3
-
[48]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[49]
Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, et al. Pp-ocrv3: More attempts for the im- provement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022. 5
-
[50]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023. 7 14
work page 2023
-
[51]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[52]
Mvitv2: Improved multiscale vision transformers for classification and detection
Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- hofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, pages 4804–4814,
-
[53]
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,
-
[54]
Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning
Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. In CVPR, pages 14963–14973, 2023. 5
work page 2023
-
[55]
Monkey: Image resolution and text label are important things for large multi-modal models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 3
-
[56]
Moe-llava: Mix- ture of experts for large vision-language models
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024. 3
-
[57]
Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 3
-
[58]
Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning
Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 5
-
[59]
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. TACL, 11:635–651, 2023. 5
work page 2023
-
[60]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 5
work page internal anchor Pith review arXiv 2023
-
[61]
Mmc: Ad- vancing multimodal chart understanding with large-scale instruc- tion tuning
Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understand- ing with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023. 5
-
[62]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 1, 2, 3, 4, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2023. 1, 5, 6
work page 2023
-
[64]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2, 3, 4, 6, 7, 8
work page 2024
-
[65]
arXiv preprint arXiv:2403.20194 , year=
Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, et al. Convbench: A multi-turn conversation eval- uation benchmark with hierarchical capability for large vision-language models. arXiv preprint arXiv:2403.20194,
-
[66]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
On the hidden mystery of ocr in large multimodal models
Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mys- tery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023. 6, 7, 8
-
[68]
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document.arXiv preprint arXiv:2403.04473, 2024. 3, 6, 7
-
[69]
Internchat: Solving vision-centric tasks by interacting with chatbots beyond language
Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond lan- guage. arXiv preprint arXiv:2305.05662, 2023. 3
-
[70]
Controlllm: Augment language models with tools by searching on graphs
Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, and Wenhai Wang. Controlllm: Augment language models with tools by searching on graphs. arXiv preprint arXiv:2310.17796, 2023. 3
-
[71]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Yaofeng Sun, et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024. 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165,
-
[73]
Learn to explain: Multimodal rea- soning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answering. NeurIPS, 35:2507–2521, 2022. 5
work page 2022
-
[74]
Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song- Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022. 5
-
[75]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts. arXiv preprint arXiv:2310.02255, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models
Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xi- aoshuai Sun, and Rongrong Ji. Feast your eyes: Mixture-of- 15 resolution adaptation for multimodal large language mod- els. arXiv preprint arXiv:2403.03003, 2024. 3
-
[77]
Kosmos-2.5: A multimodal liter- ate model
Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shum- ing Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, et al. Kosmos-2.5: A multimodal liter- ate model. arXiv preprint arXiv:2309.11419, 2023. 3
-
[78]
arXiv preprint arXiv:2404.13013 (2024)
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint arXiv:2404.13013, 2024. 4
-
[79]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20, 2016. 5
work page 2016
-
[80]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019. 5
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.