Recognition: 2 theorem links
· Lean TheoremA Survey on Multimodal Large Language Models
Pith reviewed 2026-05-16 02:48 UTC · model grok-4.3
The pith
Multimodal large language models use LLMs as a central brain to handle images and other inputs with new emergent reasoning skills.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that multimodal large language models, represented by GPT-4V, use powerful large language models as a brain to perform multimodal tasks and display surprising emergent capabilities such as writing stories based on images and OCR-free math reasoning that are rare in traditional multimodal methods, while summarizing their formulation, architecture, training strategy, data, evaluation, extensions to more granularity modalities languages and scenarios, multimodal hallucination, extended techniques including M-ICL M-CoT and LAVR, challenges, and promising directions.
What carries the argument
The central object is the large language model used as a unifying brain to process and reason over combined multimodal inputs through shared architectures and joint training.
If this is right
- MLLMs can be extended to support finer granularity, additional modalities, more languages, and complex scenarios.
- Techniques such as multimodal in-context learning, multimodal chain-of-thought reasoning, and LLM-aided visual reasoning improve performance on multimodal tasks.
- Tackling multimodal hallucination is required for dependable real-world applications.
- Continued progress in this area may open a route toward artificial general intelligence.
Where Pith is reading between the lines
- Unified LLM-centered models may replace earlier separate-modality approaches in many vision-language settings.
- Adding real-time video or audio streams could test whether current emergent skills scale to continuous inputs.
- The linked repository underscores the value of living resources for tracking fast-changing research areas.
Load-bearing premise
The survey assumes that the cited literature and the associated GitHub repository together provide a sufficiently complete and up-to-date picture of the rapidly evolving MLLM field.
What would settle it
A new review identifying many important recent MLLM papers or key developments absent from this survey and its linked repository would show the summary is incomplete.
read the original abstract
Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a survey tracing recent progress on Multimodal Large Language Models (MLLMs). It begins with the basic formulation and related concepts of architecture, training strategy, data, and evaluation. It then covers extensions supporting greater granularity, additional modalities, languages, and scenarios, followed by multimodal hallucination and techniques including Multimodal In-Context Learning (M-ICL), Multimodal Chain-of-Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR). The survey concludes with challenges, promising directions, and an associated GitHub repository for updates.
Significance. If the coverage proves comprehensive, the survey supplies a useful organizational framework for the fast-moving MLLM field, explicitly crediting emergent capabilities such as image-based story writing and OCR-free math reasoning while pointing to an open GitHub repository that collects latest papers. This combination of structured delineation and a living resource strengthens its value as a reference for researchers working on vision-language integration.
major comments (2)
- [Evaluation] The evaluation section does not quantify how well current benchmarks capture the emergent capabilities highlighted in the abstract (e.g., story writing from images); without such analysis the contrast with traditional multimodal methods remains qualitative and weakens the motivation for the survey's scope.
- [Training and Data] In the training and data section, the discussion of data curation omits explicit comparison of scale, filtering, and alignment procedures across representative models (LLaVA, MiniGPT-4, etc.), which is load-bearing for readers seeking to reproduce or extend the reported performance trends.
minor comments (3)
- [Abstract] The abstract repeats motivational phrasing about AGI that could be shortened without loss of clarity.
- [Architecture] Figure captions for architecture diagrams should explicitly label each component (vision encoder, projector, LLM backbone) to match the textual description.
- [Introduction] The GitHub repository is mentioned only in the abstract; a short dedicated paragraph in the introduction describing its maintenance policy and coverage criteria would improve usability.
Simulated Author's Rebuttal
We thank the referee for the encouraging assessment and the specific comments, which help clarify areas where the survey can be strengthened. We address each major comment below and outline the corresponding revisions.
read point-by-point responses
-
Referee: [Evaluation] The evaluation section does not quantify how well current benchmarks capture the emergent capabilities highlighted in the abstract (e.g., story writing from images); without such analysis the contrast with traditional multimodal methods remains qualitative and weakens the motivation for the survey's scope.
Authors: We acknowledge that the evaluation section primarily summarizes existing benchmarks and notes emergent capabilities without providing quantitative metrics on benchmark coverage. As this is a survey, we do not introduce new empirical evaluations; however, we will expand the section with a dedicated paragraph discussing the limitations of current benchmarks in capturing capabilities such as image-based story writing and OCR-free reasoning, referencing any available meta-analyses or studies that quantify these gaps. This addition will make the contrast with traditional methods more explicit while remaining within the survey's scope. revision: partial
-
Referee: [Training and Data] In the training and data section, the discussion of data curation omits explicit comparison of scale, filtering, and alignment procedures across representative models (LLaVA, MiniGPT-4, etc.), which is load-bearing for readers seeking to reproduce or extend the reported performance trends.
Authors: We agree that a side-by-side comparison would improve utility for readers. We will insert a new table in the training and data section that explicitly compares data scale, filtering strategies, and alignment procedures for representative models including LLaVA, MiniGPT-4, and others, based on details reported in their original papers. This table will directly address reproducibility needs. revision: yes
Circularity Check
No significant circularity; descriptive survey of external literature
full rationale
This paper is a literature survey with no original derivations, equations, quantitative predictions, or first-principles results. Its contribution is organizational: delineating architectures, training strategies, data, evaluations, extensions, hallucination, and techniques like M-ICL and M-CoT drawn from cited external works. The abstract's reference to emergent capabilities is presented as motivation from prior examples rather than a derived claim. No self-citations function as load-bearing justifications for novel results, and no steps reduce to fitted inputs or self-definitions by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
Cross-Modal Backdoors in Multimodal Large Language Models
Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
-
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
EmergentBridge improves zero-shot cross-modal transfer for unpaired modality pairs by learning noisy bridge anchors and enforcing proxy alignment only in the orthogonal subspace to preserve existing anchor alignments.
-
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
-
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
-
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradien...
-
MMaDA: Multimodal Large Diffusion Language Models
MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
-
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
A Survey of Large Language Models
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al. , “A survey of large language models,” arXiv:2303.18223, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Chatgpt: A language model for conversational ai,
OpenAI, “Chatgpt: A language model for conversational ai,” OpenAI, Tech. Rep., 2023. [Online]. Available: https: //www.openai.com/research/chatgpt 1, 6
work page 2023
-
[3]
——, “Gpt-4 technical report,” arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality,
W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al. , “Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality,”
-
[5]
Available: https://vicuna.lmsys.org 1, 3, 4
[Online]. Available: https://vicuna.lmsys.org 1, 3, 4
-
[6]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv:2302.13971, 2023. 1, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
B. Peng, C. Li, P . He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” arXiv:2304.03277, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Lan- guage models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhari- wal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell et al. , “Lan- guage models are few-shot learners,” NeurIPS, 2020. 1, 3, 6
work page 2020
-
[9]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,” arXiv:2201.11903, 2022. 1, 12
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al. , “Segment anything,” arXiv:2304.02643, 2023. 1, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Aligning and prompting everything all at once for universal visual perception,
Y. Shen, C. Fu, P . Chen, M. Zhang, K. Li, X. Sun, Y. Wu, S. Lin, and R. Ji, “Aligning and prompting everything all at once for universal visual perception,” in CVPR, 2024. 1
work page 2024
-
[12]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv:2203.03605, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V . Khalidov, P . Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,” arXiv:2304.07193, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in ICML, 2021. 1, 2, 3, 5
work page 2021
-
[15]
Align before fuse: Vision and language representation learning with momentum distillation,
J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, 2021. 1
work page 2021
-
[16]
Uniter: Universal image-text representation learn- ing,
Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learn- ing,” in ECCV, 2020. 1
work page 2020
-
[17]
P . Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in ICML, 2022. 1
work page 2022
-
[18]
Unifying vision-and- language tasks via text generation,
J. Cho, J. Lei, H. Tan, and M. Bansal, “Unifying vision-and- language tasks via text generation,” in ICML, 2021. 1
work page 2021
-
[19]
Simvlm: Simple visual language model pretraining with weak supervision,
Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “Simvlm: Simple visual language model pretraining with weak supervision,” arXiv:2108.10904, 2021. 1
-
[20]
Finetuned Language Models Are Zero-Shot Learners
J. Wei, M. Bosma, V . Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero- shot learners,” arXiv:2109.01652, 2021. 1, 6, 11
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv:2304.08485, 2023. 1, 4, 6, 7, 8, 9, 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv:2304.10592, 2023. 1, 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv:2303.11381, 2023. 1, 11, 12, 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al. , “Palm-e: An embodied multimodal language model,” arXiv:2303.03378, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y. Hanafy, W. Zhu, K. Marathe, Y. Bitton, S. Gadre, S. Sagawa et al., “Openflamingo: An open-source framework for training large autoregressive vision-language models,” arXiv:2308.01390, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
VideoChat: Chat-Centric Video Understanding
K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P . Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv:2305.06355, 2023. 1, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction- tuned audio-visual language model for video understanding,” arXiv:2306.02858, 2023. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Pengi: An audio language model for audio tasks,
S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” NeurIPS, 2024. 1, 3
work page 2024
-
[29]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal llm’s referential dialogue magic,” arXiv:2306.15195. 1, 9
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Osprey: Pixel understanding with visual instruction tuning,
Y. Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, and J. Zhu, “Osprey: Pixel understanding with visual instruction tuning,” arXiv:2312.10032. 1, 2, 9
-
[31]
Imagebind-llm: Multi-modality instruction tuning,
J. Han, R. Zhang, W. Shao, P . Gao, P . Xu, H. Xiao, K. Zhang, C. Liu, S. Wen, Z. Guo et al., “Imagebind-llm: Multi-modality instruction tuning,” arXiv:2309.03905, 2023. 1, 3
-
[32]
Anymal: An efficient and scalable any-modality augmented language model,
S. Moon, A. Madotto, Z. Lin, T. Nagarajan, M. Smith, S. Jain, C.-F. Yeh, P . Murugesan, P . Heidari, Y. Liu et al. , “Anymal: An efficient and scalable any-modality augmented language model,” arXiv:2309.16058, 2023. 1
-
[33]
Next-gpt: Any-to-any multimodal llm,
S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” arXiv:2309.05519, 2023. 1, 9
-
[34]
Large multilingual models pivot zero- shot multimodal learning across languages,
J. Hu, Y. Yao, C. Wang, S. Wang, Y. Pan, Q. Chen, T. Yu, H. Wu, Y. Zhao, H. Zhang et al., “Large multilingual models pivot zero- shot multimodal learning across languages,” arXiv:2308.12038,
-
[35]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv:2308.12966, 2023. 1, 3, 4, 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Med-flamingo: a multi- modal medical few-shot learner,
M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P . Reis, and P . Rajpurkar, “Med-flamingo: a multi- modal medical few-shot learner,” in Machine Learning for Health (ML4H), 2023. 1, 10
work page 2023
-
[38]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
X. Zhang, C. Wu, Z. Zhao, W. Lin, Y. Zhang, Y. Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answering,” arXiv:2305.10415, 2023. 1, 4, 6, 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
arXiv preprint arXiv:2307.02499 (2023)
J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y. Dan, C. Zhao, G. Xu, C. Li, J. Tian et al. , “mplug-docowl: Modularized multimodal large language model for document understanding,” arXiv:2307.02499,
-
[40]
Y. Liu, B. Yang, Q. Liu, Z. Li, Z. Ma, S. Zhang, and X. Bai, “Textmonkey: An ocr-free large multimodal model for under- standing document,” arXiv:2403.04473, 2024. 1, 10
-
[41]
mplug-paperowl: Scientific diagram analysis with the multimodal large language model,
A. Hu, Y. Shi, H. Xu, J. Ye, Q. Ye, M. Yan, C. Li, Q. Qian, J. Zhang, and F. Huang, “mplug-paperowl: Scientific diagram analysis with the multimodal large language model,” arXiv:2311.18248,
-
[42]
1 IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 15
-
[43]
An embodied generalist agent in 3d world,
J. Huang, S. Yong, X. Ma, X. Linghu, P . Li, Y. Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,” arXiv:2311.12871, 2023. 1, 9, 10
-
[44]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei, “Kosmos-2: Grounding multimodal large language models to the world,” arXiv:2306.14824, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Appagent: Multimodal agents as smartphone users,
Z. Yang, J. Liu, Y. Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “Appagent: Multimodal agents as smartphone users,” arXiv:2312.13771, 2023. 1, 10
-
[46]
Cogagent: A visual language model for gui agents,
W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y. Wang, Z. Wang, Y. Dong, M. Ding et al., “Cogagent: A visual language model for gui agents,” arXiv:2312.08914, 2023. 1, 3, 10
-
[47]
Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,
J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang, “Mobile-agent: Autonomous multi-modal mobile device agent with visual perception,” arXiv:2401.16158, 2024. 1, 10
-
[48]
Repro- ducible scaling laws for contrastive language-image learning,
M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Repro- ducible scaling laws for contrastive language-image learning,” in CVPR, 2023. 2, 3
work page 2023
-
[49]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “Eva-clip: Improved training techniques for clip at scale,” arXiv:2303.15389, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Eva: Exploring the limits of masked visual representation learning at scale,
Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in CVPR, 2023. 2
work page 2023
-
[51]
Introducing our multimodal models,
R. Bavishi, E. Elsen, C. Hawthorne, M. Nye, A. Odena, A. Somani, and S. Ta¸ sırlar, “Introducing our multimodal models,” 2023. [Online]. Available: https://www.adept.ai/blog/fuyu-8b 2
work page 2023
-
[52]
Improved Baselines with Visual Instruction Tuning
H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” arXiv:2310.03744, 2023. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Monkey: Image resolution and text label are important things for large multi-modal models
Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” arXiv:2311.06607, 2023. 3
-
[54]
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
B. McKinzie, Z. Gan, J.-P . Fauconnier, S. Dodge, B. Zhang, P . Dufter, D. Shah, X. Du, F. Peng, F. Weers et al. , “Mm1: Methods, analysis & insights from multimodal llm pre-training,” arXiv:2403.09611, 2024. 3, 4
work page internal anchor Pith review arXiv 2024
-
[55]
Z. Lin, C. Liu, R. Zhang, P . Gao, L. Qiu, H. Xiao, H. Qiu, C. Lin, W. Shao, K. Chen et al. , “Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models,” arXiv:2311.07575, 2023. 3
-
[56]
Clap learning audio concepts from natural language supervision,
B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP, 2023. 3
work page 2023
-
[57]
Imagebind: One embedding space to bind them all,
R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in CVPR, 2023. 3
work page 2023
-
[58]
Scaling Instruction-Finetuned Language Models
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma et al. , “Scaling instruction- finetuned language models,” arXiv:2210.11416, 2022. 3, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[59]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv:2307.09288, 2023. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al. , “Qwen technical report,” arXiv:2309.16609, 2023. 3, 4, 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv:2301.12597, 2023. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P . Fung, and S. Hoi, “Instructblip: Towards general- purpose vision-language models with instruction tuning,” arXiv:2305.06500, 2023. 3, 4, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Llava-next: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/ blog/2024-01-30-llava-next/ 3
work page 2024
-
[64]
An empir- ical study of scaling instruct-tuned large multimodal models,
Y. Lu, C. Li, H. Liu, J. Yang, J. Gao, and Y. Shen, “An empir- ical study of scaling instruct-tuned large multimodal models,” arXiv:2309.09958, 2023. 3
-
[65]
X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei et al. , “Mobilevlm: A fast, repro- ducible and strong vision language assistant for mobile devices,” arXiv:2312.16886, 2023. 3, 10
-
[66]
Mobilevlm v2: Faster and stronger baseline for vision language model
X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang et al., “Mobilevlm v2: Faster and stronger baseline for vision language model,” arXiv:2402.03766, 2024. 3
-
[67]
Mixture-of-experts meets instruction tuning: A winning combination for large language models,
S. Shen, L. Hou, Y. Zhou, N. Du, S. Longpre, J. Wei, H. W. Chung, B. Zoph, W. Fedus, X. Chen et al. , “Mixture-of-experts meets instruction tuning: A winning combination for large language models,” arXiv:2305.14705, 2023. 3
-
[68]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand et al., “Mixtral of experts,” arXiv:2401.04088, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” JMLR, 2022. 3
work page 2022
-
[70]
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
B. Lin, Z. Tang, Y. Ye, J. Cui, B. Zhu, P . Jin, J. Zhang, M. Ning, and L. Yuan, “Moe-llava: Mixture of experts for large vision-language models,” arXiv:2401.15947, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
End-to-end object detection with transformers,
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020. 4
work page 2020
-
[72]
F. Chen, M. Han, H. Zhao, Q. Zhang, J. Shi, S. Xu, and B. Xu, “X- llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,” arXiv:2305.04160, 2023. 4, 6, 8, 9
-
[73]
Pandagpt: One model to instruction-follow them all,
Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv:2305.16355, 2023. 4, 6
-
[74]
Detgpt: Detect what you need via reasoning,
R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, and L. K. T. Zhang, “Detgpt: Detect what you need via reasoning,” arXiv:2305.14167, 2023. 4, 7
-
[75]
What matters in training a gpt4-style language model with multimodal inputs?
Y. Zeng, H. Zhang, J. Zheng, J. Xia, G. Wei, Y. Wei, Y. Zhang, and T. Kong, “What matters in training a gpt4-style language model with multimodal inputs?” arXiv:2307.02469, 2023. 4, 7
-
[76]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, 2022. 4, 11, 12
work page 2022
-
[77]
CogVLM: Visual Expert for Pretrained Language Models
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song et al. , “Cogvlm: Visual expert for pretrained language models,” arXiv:2311.03079, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P . Lu, H. Li, P . Gao, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv:2303.16199, 2023. 4, 6, 8, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
arXiv preprint arXiv:2310.16045
S. Yin, C. Fu, S. Zhao, T. Xu, H. Wang, D. Sui, Y. Shen, K. Li, X. Sun, and E. Chen, “Woodpecker: Hallucination correction for multimodal large language models,” arXiv:2310.16045, 2023. 4, 9, 10, 11
-
[80]
J. Guo, J. Li, D. Li, A. M. H. Tiong, B. Li, D. Tao, and S. Hoi, “From images to textual prompts: Zero-shot visual question answering with frozen large language models,” in CVPR, 2023. 4
work page 2023
-
[81]
Caption anything: Interactive image description with diverse multimodal controls,
T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao, S. Zhao, Y. Shan et al. , “Caption anything: Interactive image description with diverse multimodal controls,” arXiv:2305.02677,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.