pith. sign in

arxiv: 2311.03079 · v2 · pith:CTVKNTPZnew · submitted 2023-11-06 · 💻 cs.CV

CogVLM: Visual Expert for Pretrained Language Models

Pith reviewed 2026-05-15 15:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords CogVLMvisual expert moduledeep fusionvision language modelfrozen language modelmultimodal benchmarkscross-modal taskspretrained models
0
0 comments X

The pith

A trainable visual expert module inserted into the attention and FFN layers of a frozen language model enables deep vision-language fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CogVLM adds a trainable visual expert module directly into the attention and feed-forward layers of a frozen pretrained language model. This differs from shallow alignment methods that only map image features at the input and instead allows vision and language features to interact deeply throughout the network. The resulting 17-billion-parameter model reaches state-of-the-art results on ten cross-modal benchmarks such as NoCaps, Flickr30k captioning, RefCOCO series, GQA, ScienceQA, and VizWiz while preserving full performance on pure language tasks. It matches or exceeds the results of much larger models like PaLI-X 55B on these tasks. The approach requires no changes to the original language model architecture or loss functions.

Core claim

CogVLM shows that inserting a trainable visual expert module into the attention and FFN layers of any frozen pretrained language model bridges the gap between vision and language representations. This produces deep feature fusion across modalities without sacrificing the language model's original capabilities or requiring architectural modifications. The 17B CogVLM model then delivers state-of-the-art performance on ten classic cross-modal benchmarks including NoCaps, Flickr30k, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks second on VQAv2, OKVQA, TextVQA and COCO captioning, matching or surpassing the 55B PaLI-X model.

What carries the argument

The visual expert module, a trainable component inserted into the attention and FFN layers of the frozen language model to enable deep vision-language feature fusion.

If this is right

  • The 17B model achieves state-of-the-art results on ten cross-modal benchmarks including NoCaps, Flickr30k captioning, RefCOCO series, GQA, ScienceQA and VizWiz VQA.
  • It ranks second on VQAv2, OKVQA, TextVQA and COCO captioning while matching or exceeding the 55B PaLI-X model.
  • Full performance on pure NLP tasks is retained with no architectural changes to the base language model.
  • Codes and checkpoints are released as open source for further use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The insertion technique may apply to other frozen pretrained models beyond the specific base used in the experiments.
  • Independent scaling of the visual expert size could be tested as a way to improve efficiency further.
  • The method points toward building multimodal systems by layering task-specific experts onto existing large language models rather than training everything from scratch.

Load-bearing premise

The visual expert module can be inserted into the attention and FFN layers of any frozen pretrained language model without requiring changes to the original architecture or loss functions.

What would settle it

A controlled experiment showing that inserting the visual expert either degrades accuracy on standard NLP benchmarks or fails to outperform simple input-space alignment methods on vision-language tasks would disprove the claim.

read the original abstract

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CogVLM, a 17B-parameter visual-language model that inserts a trainable visual expert module into the attention and FFN layers of a frozen pretrained language model (e.g., Vicuna or LLaMA). This enables deep vision-language fusion while claiming to preserve the original NLP capabilities of the base model without architectural changes or loss-function modifications. The model reports state-of-the-art results on 10 cross-modal benchmarks (NoCaps, Flickr30k, RefCOCO variants, Visual7W, GQA, ScienceQA, VizWiz, TDIUC) and competitive performance on VQAv2, OKVQA, TextVQA, and COCO captioning, matching or exceeding PaLI-X 55B.

Significance. If the no-sacrifice claim on NLP tasks is substantiated, the approach would offer an efficient route to strong multimodal performance by leveraging existing frozen LMs, avoiding full retraining costs. The open-source release of code and checkpoints strengthens reproducibility and potential adoption.

major comments (2)
  1. [Abstract and results] Abstract and results section: The central claim that insertion of the visual expert 'does not sacrifice any performance on NLP tasks' is load-bearing for the method's value proposition, yet the manuscript provides no quantitative before/after comparison on the base LM's original NLP benchmarks (e.g., MMLU, HumanEval, or Vicuna test suite). Only VLM benchmark numbers are reported.
  2. [Experiments] Experimental setup: No error bars, ablation studies on the visual expert placement, or training curves are mentioned in the abstract or high-level results, making it impossible to assess whether post-hoc protocol choices affect the reported SOTA rankings.
minor comments (2)
  1. [Abstract] The abstract lists 10 benchmarks but does not specify the exact evaluation protocols or splits used for each; this should be clarified for reproducibility.
  2. [Figures/Tables] Figure and table captions could more explicitly state whether results are zero-shot or fine-tuned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and results] Abstract and results section: The central claim that insertion of the visual expert 'does not sacrifice any performance on NLP tasks' is load-bearing for the method's value proposition, yet the manuscript provides no quantitative before/after comparison on the base LM's original NLP benchmarks (e.g., MMLU, HumanEval, or Vicuna test suite). Only VLM benchmark numbers are reported.

    Authors: We agree that explicit quantitative evidence is needed to substantiate the no-sacrifice claim. The design freezes the language model weights, which by construction preserves NLP behavior, but we did not report direct before/after numbers on standard NLP suites in the initial submission. In the revision we will add comparisons on MMLU and a Vicuna-style evaluation subset to directly demonstrate preservation of performance. revision: yes

  2. Referee: [Experiments] Experimental setup: No error bars, ablation studies on the visual expert placement, or training curves are mentioned in the abstract or high-level results, making it impossible to assess whether post-hoc protocol choices affect the reported SOTA rankings.

    Authors: Ablation studies on visual expert placement (attention vs. FFN layers and layer indices) are already present in Section 4.3 and the appendix. To address the concern we will elevate the key ablation tables into the main results section and add error bars (from 3 random seeds) plus representative training curves to the revised high-level results and supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture claims rest on external benchmarks

full rationale

The paper describes an architectural change (trainable visual expert inserted into frozen LM attention/FFN layers) and supports its claims via direct empirical results on 10+ cross-modal benchmarks, with comparisons to external models such as PaLI-X 55B. No equations, derivations, or parameter-fitting steps are present that reduce by construction to the inputs. The assertion of preserved NLP performance is presented as an empirical outcome rather than a self-referential prediction or self-citation chain. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes that a small trainable module inserted into frozen transformer layers can learn deep cross-modal alignment without any modification to the language model's pretraining objective or architecture.

axioms (1)
  • domain assumption A frozen pretrained language model can serve as a fixed backbone for vision-language tasks when augmented by an internal visual expert.
    Stated in the abstract as the core design choice.

pith-pipeline@v0.9.0 · 5526 in / 1178 out tokens · 23582 ms · 2026-05-15T15:41:06.257046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

    cs.CL 2024-10 unverdicted novelty 8.0

    ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

  2. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  3. Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

    cs.CV 2026-05 unverdicted novelty 7.0

    A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.

  4. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  5. MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

  6. UIPress: Bringing Optical Token Compression to UI-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...

  7. From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models

    cs.CV 2025-05 unverdicted novelty 7.0

    Reformulates landmark visibility assessment as VLM-based detection on street view images and builds a heterogeneous visibility graph to map visual connections among landmarks and urban spaces.

  8. AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

    cs.CL 2025-03 unverdicted novelty 7.0

    AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach wi...

  9. OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

    cs.CV 2024-12 accept novelty 7.0

    OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

  10. HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

    cs.CV 2024-12 unverdicted novelty 7.0

    HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion p...

  11. Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

    cs.LG 2024-12 conditional novelty 7.0

    PAR fine-tunes CLIP to remove backdoors from structured triggers while preserving standard performance, and works even with only synthetic image-text pairs.

  12. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

    cs.IR 2024-10 conditional novelty 7.0

    VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.

  13. LVBench: An Extreme Long Video Understanding Benchmark

    cs.CV 2024-06 accept novelty 7.0

    LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

  14. Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

    cs.CL 2024-05 unverdicted novelty 7.0

    Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.

  15. State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

    cs.CV 2026-04 unverdicted novelty 6.0

    MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...

  16. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  17. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  18. InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    cs.AI 2025-04 unverdicted novelty 6.0

    InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding a...

  19. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  20. When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

    cs.CV 2025-03 unverdicted novelty 6.0

    Presents YesBut (V2) benchmark and shows state-of-the-art VLMs significantly underperform humans on tasks requiring comparative reasoning for contradictory humor in comics.

  21. MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

    cs.CV 2025-01 unverdicted novelty 6.0

    MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.

  22. VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    cs.CV 2024-12 unverdicted novelty 6.0

    VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% high...

  23. MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models

    cs.CV 2024-12 unverdicted novelty 6.0

    MM-MoralBench is a new multimodal benchmark that evaluates over 20 LVLMs on moral judgment, classification, and response tasks, finding pronounced divergence from human consensus and limited benefits from scaling or c...

  24. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  25. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    cs.CL 2024-11 conditional novelty 6.0

    Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

  26. Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

    cs.CL 2024-11 conditional novelty 6.0

    MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.

  27. Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    cs.CV 2024-10 conditional novelty 6.0

    Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine...

  28. LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    cs.CV 2024-10 unverdicted novelty 6.0

    LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...

  29. SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    cs.CV 2024-10 accept novelty 6.0

    SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.

  30. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    cs.CV 2024-08 unverdicted novelty 6.0

    CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

  31. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  32. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    cs.CV 2024-03 unverdicted novelty 6.0

    MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

  33. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    cs.CV 2024-03 conditional novelty 6.0

    Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

  34. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 6.0

    DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...

  35. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    cs.HC 2024-01 unverdicted novelty 6.0

    SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

  36. MMBench: Is Your Multi-modal Model an All-around Player?

    cs.CV 2023-07 accept novelty 6.0

    MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

  37. NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

    cs.CV 2025-10 unverdicted novelty 5.0

    NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal ...

  38. CogVLM2: Visual Language Models for Image and Video Understanding

    cs.CV 2024-08 conditional novelty 5.0

    CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.

  39. mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

    cs.CV 2024-08 unverdicted novelty 5.0

    mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

  40. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  41. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  42. Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    cs.CV 2024-03 unverdicted novelty 5.0

    Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.

  43. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  44. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  45. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 45 Pith papers · 18 internal anchors

  1. [1]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y ., Zhu, W., Marathe, K., Bitton, Y ., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P. W., Ilharco, G., Worts- man, M., and Schmidt, L. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 ,

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv preprint arXiv:2308.12966,

  3. [3]

    Murel: Multimodal relational reasoning for visual ques- tion answering

    Cadene, R., Ben-Younes, H., Cord, M., and Thome, N. Murel: Multimodal relational reasoning for visual ques- tion answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pp. 1989–1998,

  4. [4]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a. Chen, T., Li, L., Saxena, S., Hinton, G., and Fleet, D. J. A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366 , 2022a. Chen, X., Wang, X....

  5. [5]

    Universal captioner: Long-tail vision-and-language model training through content-style separation.arXiv preprint arXiv:2111.12727,

    Cornia, M., Baraldi, L., Fiameni, G., and Cucchiara, R. Uni- versal captioner: Long-tail vision-and-language model training through content-style separation. arXiv preprint arXiv:2111.12727, 1(2):4,

  6. [6]

    Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023

    Dong, R., Han, C., Peng, Y ., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., et al. Dreamllm: Syn- ergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499,

  7. [7]

    PaLM-E: An Embodied Multimodal Language Model

    Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,

  8. [8]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding. arXiv preprint arXiv:2009.03300,

  9. [9]

    and Johnson, M

    Honnibal, M. and Johnson, M. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1373–1378,

  10. [10]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

  11. [11]

    and Kanan, C

    9 CogVLM: Visual Expert for Pretrained Language Models Kafle, K. and Kanan, C. An analysis of visual question answering algorithms. In Proceedings of the IEEE inter- national conference on computer vision , pp. 1965–1973,

  12. [12]

    Referitgame: Referring to objects in photographs of natu- ral scenes

    Kazemzadeh, S., Ordonez, V ., Matten, M., and Berg, T. Referitgame: Referring to objects in photographs of natu- ral scenes. In Proceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP), pp. 787–798,

  13. [13]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., and Shan, Y . Seed-bench: Benchmarking multimodal llms with gener- ative comprehension. arXiv preprint arXiv:2307.16125, 2023a. Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot- strapping language-image pre-training with frozen im- age encoders and large language models. arXiv preprint arXiv:2301.12597, ...

  14. [14]

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

    Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023b. Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Aligning large multi-modal model with robust ins...

  15. [15]

    Prismer:

    Liu, S., Fan, L., Johns, E., Yu, Z., Xiao, C., and Anand- kumar, A. Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506, 2023d. Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection...

  16. [16]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255,

  17. [17]

    K., and Chakraborty, A

    Mishra, A., Shekhar, S., Singh, A. K., and Chakraborty, A. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp. 947–952. IEEE,

  18. [18]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence...

  19. [19]

    GLU Variants Improve Transformer

    Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,

  20. [20]

    Textcaps: a dataset for image captioning with reading comprehen- sion

    Sidorov, O., Hu, R., Rohrbach, M., and Singh, A. Textcaps: a dataset for image captioning with reading comprehen- sion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part II 16, pp. 742–758. Springer,

  21. [21]

    Generative multimodal models are in-context learners

    Sun, Q., Cui, Y ., Zhang, X., Zhang, F., Yu, Q., Luo, Z., Wang, Y ., Rao, Y ., Liu, J., Huang, T., et al. Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a. Sun, Q., Fang, Y ., Wu, L., Wang, X., and Cao, Y . Eva- clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023b. Touvron,...

  22. [22]

    GIT: A Generative Image-to-text Transformer for Vision and Language

    Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., and Wang, L. Git: A generative image-to- text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a. Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. Ofa: Unifying architec- tures, tasks, and modalities through a sim...

  23. [23]

    mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

    Ye, Q., Xu, H., Ye, J., Yan, M., Liu, H., Qian, Q., Zhang, J., Huang, F., and Zhou, J. mplug-owl2: Revolution- izing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257,

  24. [24]

    Ferret: Refer and Ground Anything Anywhere at Any Granularity

    You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.-F., and Yang, Y . Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704,

  25. [25]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    11 CogVLM: Visual Expert for Pretrained Language Models Yu, J., Wang, Z., Vasudevan, V ., Yeung, L., Seyedhosseini, M., and Wu, Y . Coca: Contrastive captioners are image- text foundation models. arXiv preprint arXiv:2205.01917,

  26. [26]

    C., and Berg, T

    Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amster- dam, The Netherlands, October 11-14, 2016, Proceed- ings, Part II 14, pp. 69–85. Springer,

  27. [27]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multi- modal models for integrated capabilities. arXiv preprint arXiv:2308.02490,

  28. [28]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502,

  29. [29]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

  30. [30]

    The Captions in COCO dataset are collected using Amazon’s Mechanical Turk (AMT) workers who are given instructions to control the quality. The dataset contains 330K images, where the train, validation and test sets contain 413,915 captions for 82,783 images, 202,520 captions for 40,504 images, and 379,249 captions for 40,775 images respectively. • NoCaps ...

  31. [31]

    in”, “near

    Summary of the evaluation benchmarks. Task Dataset Description Split Metrics Image Caption NoCaps Captioning of natural images. val CIDEr ( ↑) Flickr Captioning of natural images. karpathy-test CIDEr ( ↑) COCO Captioning of natural images. karpathy-test CIDEr ( ↑) TextCaps Captioning of natural images containing text. test CIDEr ( ↑) General VQA VQAv2 VQA...

  32. [32]

    which”-type 15 CogVLM: Visual Expert for Pretrained Language Models question, such as “Which is the small computer in the corner?

    The RefCOCOg subset was amassed through Amazon Mechanical Turk, where workers penned natural referring expressions for objects in MSCOCO images; it boasts 85,474 referring expressions spanning 26,711 images, each containing 2 to 4 objects of the same category. • Visual7W (Zhu et al., 2016). The Visual7W dataset is predominantly designed for VQA tasks, wit...

  33. [33]

    Performance on TDIUC benchmark with fine-grained questions classes. B. Additional Fine-grained Experiments To comprehensively investigate the proposed model on specific topics and question types, we further conduct extensive experiments on a representative benchmark, TDIUC (Kafle & Kanan, 2017). We use the publicly available split of val set as evaluation...