arxiv: 2404.16821 · v2 · submitted 2024-04-25 · 💻 cs.CV

Recognition: 1 theorem link

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen , Weiyun Wang , Hao Tian , Shenglong Ye , Zhangwei Gao , Erfei Cui , Wenwen Tong , Kongzhi Hu

show 27 more authors

Jiapeng Luo Zheng Ma Ji Ma Jiaqi Wang Xiaoyi Dong Hang Yan Hewei Guo Conghui He Botian Shi Zhenjiang Jin Chao Xu Bin Wang Xingjian Wei Wei Li Wenjian Zhang Bo Zhang Pinlong Cai Licheng Wen Xiangchao Yan Min Dou Lewei Lu Xizhou Zhu Tong Lu Dahua Lin Yu Qiao Jifeng Dai Wenhai Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 20:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelInternVLvision encoderdynamic high-resolutionbilingual datasetOCRbenchmarksopen-source

0 comments

The pith

InternVL 1.5 reaches state-of-the-art on 8 of 18 multimodal benchmarks by strengthening its vision encoder, adding dynamic high-resolution tiling, and using a new bilingual dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternVL 1.5 as an open-source multimodal large language model meant to close the gap with proprietary systems such as GPT-4V. It applies three changes: continuous training of the InternViT-6B vision encoder to make it stronger and reusable, dynamic division of images into 1 to 40 tiles of 448 by 448 pixels so the model can accept up to 4K resolution, and collection of a high-quality bilingual dataset with English and Chinese question-answer pairs focused on common scenes and documents. On a set of 18 benchmarks the resulting model matches or exceeds both open-source and commercial models, with top scores in eight cases, especially on OCR and Chinese-language tasks.

Core claim

InternVL 1.5 shows that an open-source MLLM can achieve competitive or superior results against proprietary multimodal models by combining a continuously trained InternViT-6B vision encoder, dynamic high-resolution image tiling up to 4K, and a carefully curated high-quality bilingual dataset covering everyday scenes and documents.

What carries the argument

InternVL 1.5, which couples the InternViT-6B vision encoder with dynamic tiling of input images into variable numbers of 448x448 patches and bilingual question-answer supervision to support higher-resolution and multilingual visual understanding.

If this is right

Open-source multimodal models gain the ability to process high-resolution document and scene images without fixed resolution limits.
OCR accuracy and performance on Chinese-language visual tasks rise markedly from the bilingual training data.
Vision encoders can be trained once with continuous learning and then reused across multiple language models.
The performance gap between open and closed multimodal systems narrows on standard evaluation suites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider availability of capable open multimodal models could reduce dependence on commercial APIs for visual reasoning applications.
The dynamic tiling method offers a practical way to handle images of widely varying sizes and aspect ratios in future models.
Similar combinations of encoder scaling, resolution flexibility, and targeted data curation may extend to video or other modalities.

Load-bearing premise

The three improvements create genuine gains that generalize to new multimodal tasks rather than just improving scores on the particular benchmarks chosen for evaluation.

What would settle it

A new multimodal benchmark or real-world task set where InternVL 1.5 falls substantially behind the strongest proprietary models even after applying the same three improvements.

read the original abstract

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InternVL 1.5 combines continuous vision pretraining, dynamic tiling, and bilingual data to hit competitive numbers on multimodal benchmarks and ships the code, but the claims need ablations and fair baseline checks to hold up.

read the letter

The core update is straightforward: they take the prior InternVL setup and apply three targeted changes. First, they run continuous learning on InternViT-6B to improve the vision encoder so it transfers better. Second, they add dynamic tiling that splits images into 1-40 patches of 448x448 based on aspect ratio, supporting up to 4K input. Third, they build a curated bilingual QA set covering everyday scenes and documents in English and Chinese. The result is reported SOTA on 8 of 18 benchmarks and parity with models like GPT-4V on several others, with full weights and code released on GitHub.

Referee Report

3 major / 2 minor

Summary. The paper introduces InternVL 1.5, an open-source multimodal large language model (MLLM) that incorporates three targeted improvements: (1) continuous learning to strengthen the InternViT-6B vision encoder for better transferability across LLMs, (2) dynamic high-resolution processing that adaptively tiles input images into 1–40 patches of 448×448 pixels (supporting up to 4K resolution based on aspect ratio), and (3) a curated high-quality bilingual (English–Chinese) QA dataset focused on everyday scenes and document images. The central claim is that these changes enable InternVL 1.5 to achieve competitive performance against both open-source and proprietary models, attaining state-of-the-art results on 8 of 18 multimodal benchmarks.

Significance. If the reported benchmark results prove robust under detailed scrutiny, the work would be significant as a practical demonstration that modest, reusable enhancements in vision encoding, resolution handling, and bilingual data curation can substantially narrow the gap between open-source and commercial multimodal systems. The public code release and the emphasis on a transferable vision foundation model provide concrete assets for the community.

major comments (3)

[Abstract] Abstract: The claim that InternVL 1.5 achieves state-of-the-art results on 8 of 18 benchmarks is presented without naming the benchmarks, reporting numerical scores, listing baselines, or referencing any results table or statistical test. This absence is load-bearing for the central performance claim and prevents verification of whether the three improvements genuinely close the gap to GPT-4V-scale models.
[Evaluation] Evaluation section: No ablation studies or controlled experiments are described that isolate the contribution of the stronger vision encoder, the dynamic tiling mechanism, or the bilingual dataset on held-out tasks. Without such controls it remains unclear whether the reported gains reflect genuine generalization or benchmark-specific tuning.
[Comparisons] Comparisons: The manuscript provides no explicit statement that proprietary baselines (GPT-4V, etc.) were evaluated under identical conditions—same prompts, decoding parameters, and image-resolution handling—as the proposed dynamic-tiling pipeline. Inconsistent protocols would undermine the claim of competitive or superior performance.

minor comments (2)

[Abstract] The description of the dynamic tiling strategy (1–40 tiles of 448×448) would be clearer with an accompanying figure showing examples for different aspect ratios and resolutions.
[Evaluation] Ensure every benchmark cited in the results is accompanied by its original reference and a brief description of the metric used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that InternVL 1.5 achieves state-of-the-art results on 8 of 18 benchmarks is presented without naming the benchmarks, reporting numerical scores, listing baselines, or referencing any results table or statistical test. This absence is load-bearing for the central performance claim and prevents verification of whether the three improvements genuinely close the gap to GPT-4V-scale models.

Authors: We agree that the abstract is too concise to fully support the central claim. In the revised manuscript, we will expand the abstract to name the specific benchmarks achieving SOTA results, include key numerical scores and main baselines, and explicitly reference the primary results table for full verification. revision: yes
Referee: [Evaluation] Evaluation section: No ablation studies or controlled experiments are described that isolate the contribution of the stronger vision encoder, the dynamic tiling mechanism, or the bilingual dataset on held-out tasks. Without such controls it remains unclear whether the reported gains reflect genuine generalization or benchmark-specific tuning.

Authors: We acknowledge that dedicated ablations would better isolate each component's contribution. The submitted manuscript emphasizes overall benchmark comparisons rather than component-wise controls. In the revision, we will add a new subsection with controlled ablation experiments evaluating the impact of the stronger vision encoder, dynamic high-resolution tiling, and bilingual dataset on held-out tasks. revision: yes
Referee: [Comparisons] Comparisons: The manuscript provides no explicit statement that proprietary baselines (GPT-4V, etc.) were evaluated under identical conditions—same prompts, decoding parameters, and image-resolution handling—as the proposed dynamic-tiling pipeline. Inconsistent protocols would undermine the claim of competitive or superior performance.

Authors: We thank the referee for highlighting this methodological point. For open-source models we performed evaluations under consistent settings; for proprietary models we used officially published results, as identical re-evaluation is constrained by API access and policies. We will add an explicit paragraph in the Comparisons section detailing the protocols for each baseline type, including any differences in prompting or resolution handling, to ensure full transparency. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark claims rest on external evaluations independent of model definitions.

full rationale

The paper describes three concrete improvements (continuous learning on InternViT-6B, dynamic 1-40 tile high-res up to 4K, and new bilingual QA dataset) and reports measured performance on 18 external benchmarks, with SOTA on 8. No equations, derivations, or self-referential definitions exist. Performance numbers are not fitted parameters renamed as predictions, nor do they reduce by construction to the inputs. Self-citations to prior InternVL work are present but not load-bearing; the central claims are directly falsifiable via the cited benchmarks under standard protocols. This is a standard empirical release paper whose results do not collapse into definitional equivalence with its own training choices.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about scaling vision encoders and the value of curated data; no new free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (2)

domain assumption Continuous learning on a large vision foundation model improves transferable visual understanding.
Invoked to justify the first improvement.
domain assumption Dynamic tiling of images into 448x448 patches preserves information for high-resolution inputs.
Invoked to justify the second improvement.

pith-pipeline@v0.9.0 · 5658 in / 1341 out tokens · 98750 ms · 2026-05-12T20:53:52.362772+00:00 · methodology

discussion (0)

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression
cs.NE 2026-04 unverdicted novelty 8.0

SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
cs.CV 2026-04 unverdicted novelty 7.0

SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
cs.CV 2026-04 unverdicted novelty 7.0

EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Language-Conditioned Visual Grounding with CLIP Multilingual
cs.CL 2026-05 unverdicted novelty 6.0

Fixing the visual encoder in multilingual CLIP isolates text-branch deficits as the cause of lower visual grounding performance for low-resource languages, with model scaling widening some gaps but not others.
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 6.0

SPARK constructs unified knowledge graphs from multi-document scientific literature to ground self-play RL with asymmetric roles and verifiable rewards, outperforming flat-corpus baselines especially on longer-hop rea...
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
cs.AI 2026-04 unverdicted novelty 6.0

PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.
Latent Denoising Improves Visual Alignment in Large Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models
cs.CY 2026-04 conditional novelty 6.0

BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
cs.CV 2026-03 unverdicted novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
cs.CL 2024-04 accept novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
cs.CV 2026-05 unverdicted novelty 5.0

ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.
Make Your LVLM KV Cache More Lightweight
cs.CV 2026-05 unverdicted novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
cs.CV 2026-04 unverdicted novelty 5.0

An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
cs.CV 2026-04 unverdicted novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

142 extracted references · 142 canonical work pages · cited by 30 Pith papers · 28 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022. 3

work page 2022
[3]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com , 2024. 2, 3, 6, 7

work page 2024
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Coig-cqia: Quality is all you need for chinese instruction fine-tuning

Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, et al. Coig-cqia: Quality is all you need for chinese instruction fine-tuning. arXiv preprint arXiv:2403.18058, 2024. 5, 6

work page arXiv 2024
[7]

CoRR , volume =

Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 3

work page arXiv 2023
[8]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open- source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 5

work page 2019
[10]

Coyo-700m: Image-text pair dataset, 2022

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset, 2022. 4, 5, 6

work page 2022
[11]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 2, 3, 4, 6

work page internal anchor Pith review arXiv 2024
[12]

An augmented benchmark dataset for geometric question answering through dual parallel text encoding

Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511–1520, 2022. 5

work page 2022
[13]

Videollm: Modeling video sequence with large language models

Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023. 3

work page arXiv 2023
[14]

H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Juny- ing Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jian- quan Li, Xiang Wan, and Benyou Wang. Allava: Harness- ing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024. 5, 6

work page arXiv 2024
[15]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 5, 7

work page internal anchor Pith review arXiv 2023
[17]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server. arXiv preprint arXiv:1504.00325, 2015. 4, 5

work page internal anchor Pith review arXiv 2015
[18]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 1, 2, 3, 4, 6

work page internal anchor Pith review arXiv 2023
[19]

Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art

Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art. In ICDAR, pages 1571–1576, 2019. 5

work page 2019
[20]

Simple and effec- tive multi-paragraph reading comprehension

Christopher Clark and Matt Gardner. Simple and effec- tive multi-paragraph reading comprehension. InACL, pages 845–855, 2018. 5

work page 2018
[21]

Opencompass: A univer- sal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: / / github . com / open - compass / opencompass,

work page
[22]

Visual dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, pages 326–335, 2017. 5

work page 2017
[23]

Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 1, 2, 3, 4, 7

work page arXiv 2024
[24]

InternLM- XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2- 4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024. 3 13

work page arXiv 2024
[25]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335, 2022. 3

work page 2022
[26]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive eval- uation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010,

work page internal anchor Pith review arXiv
[28]

Making the v in vqa matter: El- evating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017. 5

work page 2017
[29]

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. NeurIPS, 35: 26418–26431, 2022. 4, 5

work page 2022
[30]

HallusionBench: An advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language halluci- nation & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023. 6, 7

work page arXiv 2023
[31]

Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models

Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Dahua Lin. Wanjuan: A comprehensive multimodal dataset for ad- vancing english and chinese large models. arXiv preprint arXiv:2308.10755, 2023. 5

work page arXiv 2023
[32]

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914 ,

work page arXiv
[33]

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mplug-docowl 1.5: Unified structure learn- ing for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024. 3, 6, 7

work page arXiv 2024
[34]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 5

work page 2019
[35]

Introducing hpt: A family of leading multimodal llms

HyperGAI Research Team. Introducing hpt: A family of leading multimodal llms. https://www.hypergai. com/blog/introducing- hpt- a- family- of- leading-multimodal-llms, 2024. 3, 6

work page 2024
[36]

Ha and J

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip. Zenodo. Version 0.1. https://doi.org/10. 5281/zenodo.5143773 , 2021. DOI: 10.5281/zen- odo.5143773. 4

work page doi:10.5281/zen- 2021
[37]

Textocr gpt-4v dataset

Jimmycarter. Textocr gpt-4v dataset. https : //huggingface.co/datasets/jimmycarter/ textocr-gpt4v, 2023. 5

work page 2023
[38]

Dvqa: Understanding data visualizations via ques- tion answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. In CVPR, pages 5648–5656, 2018. 5

work page 2018
[39]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016. 5, 6, 7

work page 2016
[40]

Are you smarter than a sixth grader? textbook question an- swering for multimodal machine comprehension

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question an- swering for multimodal machine comprehension. InCVPR, pages 4999–5007, 2017. 5

work page 2017
[41]

Ocr- free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr- free document understanding transformer. In ECCV, 2022. 5

work page 2022
[42]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017. 5

work page 2017
[43]

Lisa: Reasoning segmentation via large language model,

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning seg- mentation via large language model. arXiv preprint arXiv:2308.00692, 2023. 3

work page arXiv 2023
[44]

Gpt-4v dataset

LAION. Gpt-4v dataset. https://huggingface. co/datasets/laion/gpt4v-dataset, 2023. 5

work page 2023
[45]

Viquae, a dataset for knowledge-based visual question answering about named entities

Paul Lerner, Olivier Ferret, Camille Guinaudeau, Herv ´e Le Borgne, Romaric Besanc ¸on, Jos´e G Moreno, and Jes ´us Lov´on Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In SIGIR, pages 3108–3120, 2022. 5

work page 2022
[46]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Otterhd: A high-resolution multi- modality model

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model. arXiv preprint arXiv:2311.04219 , 2023. 3

work page arXiv 2023
[48]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3

work page internal anchor Pith review arXiv 2023
[49]

PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system.arXiv preprint arXiv:2206.03001, 2022

Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, et al. Pp-ocrv3: More attempts for the im- provement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022. 5

work page arXiv 2022
[50]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023. 7 14

work page 2023
[51]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 3

work page internal anchor Pith review arXiv 2023
[52]

Mvitv2: Improved multiscale vision transformers for classification and detection

Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- hofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, pages 4804–4814,

work page
[53]

Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,

work page arXiv
[54]

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. In CVPR, pages 14963–14973, 2023. 5

work page 2023
[55]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 3

work page arXiv 2023
[56]

Moe-llava: Mix- ture of experts for large vision-language models

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024. 3

work page arXiv 2024
[57]

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 3

work page arXiv 2023
[58]

Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning

Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 5

work page arXiv 2022
[59]

Visual spatial reasoning

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. TACL, 11:635–651, 2023. 5

work page 2023
[60]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 5

work page internal anchor Pith review arXiv 2023
[61]

Mmc: Ad- vancing multimodal chart understanding with large-scale instruc- tion tuning

Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understand- ing with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023. 5

work page arXiv 2023
[62]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 1, 2, 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2023. 1, 5, 6

work page 2023
[64]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2, 3, 4, 6, 7, 8

work page 2024
[65]

arXiv preprint arXiv:2403.20194 , year=

Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, et al. Convbench: A multi-turn conversation eval- uation benchmark with hierarchical capability for large vision-language models. arXiv preprint arXiv:2403.20194,

work page arXiv
[66]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

On the hidden mystery of ocr in large multimodal models

Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mys- tery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023. 6, 7, 8

work page arXiv 2023
[68]

Textmonkey: Anocr-freelargemultimodal model for understanding document.arXiv preprint arXiv:2403.04473,

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document.arXiv preprint arXiv:2403.04473, 2024. 3, 6, 7

work page arXiv 2024
[69]

Internchat: Solving vision-centric tasks by interacting with chatbots beyond language

Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond lan- guage. arXiv preprint arXiv:2305.05662, 2023. 3

work page arXiv 2023
[70]

Controlllm: Augment language models with tools by searching on graphs

Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, and Wenhai Wang. Controlllm: Augment language models with tools by searching on graphs. arXiv preprint arXiv:2310.17796, 2023. 3

work page arXiv 2023
[71]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Yaofeng Sun, et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024. 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165,

work page arXiv
[73]

Learn to explain: Multimodal rea- soning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answering. NeurIPS, 35:2507–2521, 2022. 5

work page 2022
[74]

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song- Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022. 5

work page arXiv 2022
[75]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts. arXiv preprint arXiv:2310.02255, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models

Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xi- aoshuai Sun, and Rongrong Ji. Feast your eyes: Mixture-of- 15 resolution adaptation for multimodal large language mod- els. arXiv preprint arXiv:2403.03003, 2024. 3

work page arXiv 2024
[77]

Kosmos-2.5: A multimodal liter- ate model

Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shum- ing Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, et al. Kosmos-2.5: A multimodal liter- ate model. arXiv preprint arXiv:2309.11419, 2023. 3

work page arXiv 2023
[78]

arXiv preprint arXiv:2404.13013 (2024)

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint arXiv:2404.13013, 2024. 4

work page arXiv 2024
[79]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20, 2016. 5

work page 2016
[80]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019. 5

work page 2019

Showing first 80 references.