pith. machine review for the scientific record. sign in

arxiv: 2311.03079 · v2 · submitted 2023-11-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CogVLM: Visual Expert for Pretrained Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords CogVLMvisual expert moduledeep fusionvision language modelfrozen language modelmultimodal benchmarkscross-modal taskspretrained models
0
0 comments X

The pith

A trainable visual expert module inserted into the attention and FFN layers of a frozen language model enables deep vision-language fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CogVLM adds a trainable visual expert module directly into the attention and feed-forward layers of a frozen pretrained language model. This differs from shallow alignment methods that only map image features at the input and instead allows vision and language features to interact deeply throughout the network. The resulting 17-billion-parameter model reaches state-of-the-art results on ten cross-modal benchmarks such as NoCaps, Flickr30k captioning, RefCOCO series, GQA, ScienceQA, and VizWiz while preserving full performance on pure language tasks. It matches or exceeds the results of much larger models like PaLI-X 55B on these tasks. The approach requires no changes to the original language model architecture or loss functions.

Core claim

CogVLM shows that inserting a trainable visual expert module into the attention and FFN layers of any frozen pretrained language model bridges the gap between vision and language representations. This produces deep feature fusion across modalities without sacrificing the language model's original capabilities or requiring architectural modifications. The 17B CogVLM model then delivers state-of-the-art performance on ten classic cross-modal benchmarks including NoCaps, Flickr30k, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks second on VQAv2, OKVQA, TextVQA and COCO captioning, matching or surpassing the 55B PaLI-X model.

What carries the argument

The visual expert module, a trainable component inserted into the attention and FFN layers of the frozen language model to enable deep vision-language feature fusion.

If this is right

  • The 17B model achieves state-of-the-art results on ten cross-modal benchmarks including NoCaps, Flickr30k captioning, RefCOCO series, GQA, ScienceQA and VizWiz VQA.
  • It ranks second on VQAv2, OKVQA, TextVQA and COCO captioning while matching or exceeding the 55B PaLI-X model.
  • Full performance on pure NLP tasks is retained with no architectural changes to the base language model.
  • Codes and checkpoints are released as open source for further use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The insertion technique may apply to other frozen pretrained models beyond the specific base used in the experiments.
  • Independent scaling of the visual expert size could be tested as a way to improve efficiency further.
  • The method points toward building multimodal systems by layering task-specific experts onto existing large language models rather than training everything from scratch.

Load-bearing premise

The visual expert module can be inserted into the attention and FFN layers of any frozen pretrained language model without requiring changes to the original architecture or loss functions.

What would settle it

A controlled experiment showing that inserting the visual expert either degrades accuracy on standard NLP benchmarks or fails to outperform simple input-space alignment methods on vision-language tasks would disprove the claim.

read the original abstract

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CogVLM, a 17B-parameter visual-language model that inserts a trainable visual expert module into the attention and FFN layers of a frozen pretrained language model (e.g., Vicuna or LLaMA). This enables deep vision-language fusion while claiming to preserve the original NLP capabilities of the base model without architectural changes or loss-function modifications. The model reports state-of-the-art results on 10 cross-modal benchmarks (NoCaps, Flickr30k, RefCOCO variants, Visual7W, GQA, ScienceQA, VizWiz, TDIUC) and competitive performance on VQAv2, OKVQA, TextVQA, and COCO captioning, matching or exceeding PaLI-X 55B.

Significance. If the no-sacrifice claim on NLP tasks is substantiated, the approach would offer an efficient route to strong multimodal performance by leveraging existing frozen LMs, avoiding full retraining costs. The open-source release of code and checkpoints strengthens reproducibility and potential adoption.

major comments (2)
  1. [Abstract and results] Abstract and results section: The central claim that insertion of the visual expert 'does not sacrifice any performance on NLP tasks' is load-bearing for the method's value proposition, yet the manuscript provides no quantitative before/after comparison on the base LM's original NLP benchmarks (e.g., MMLU, HumanEval, or Vicuna test suite). Only VLM benchmark numbers are reported.
  2. [Experiments] Experimental setup: No error bars, ablation studies on the visual expert placement, or training curves are mentioned in the abstract or high-level results, making it impossible to assess whether post-hoc protocol choices affect the reported SOTA rankings.
minor comments (2)
  1. [Abstract] The abstract lists 10 benchmarks but does not specify the exact evaluation protocols or splits used for each; this should be clarified for reproducibility.
  2. [Figures/Tables] Figure and table captions could more explicitly state whether results are zero-shot or fine-tuned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and results] Abstract and results section: The central claim that insertion of the visual expert 'does not sacrifice any performance on NLP tasks' is load-bearing for the method's value proposition, yet the manuscript provides no quantitative before/after comparison on the base LM's original NLP benchmarks (e.g., MMLU, HumanEval, or Vicuna test suite). Only VLM benchmark numbers are reported.

    Authors: We agree that explicit quantitative evidence is needed to substantiate the no-sacrifice claim. The design freezes the language model weights, which by construction preserves NLP behavior, but we did not report direct before/after numbers on standard NLP suites in the initial submission. In the revision we will add comparisons on MMLU and a Vicuna-style evaluation subset to directly demonstrate preservation of performance. revision: yes

  2. Referee: [Experiments] Experimental setup: No error bars, ablation studies on the visual expert placement, or training curves are mentioned in the abstract or high-level results, making it impossible to assess whether post-hoc protocol choices affect the reported SOTA rankings.

    Authors: Ablation studies on visual expert placement (attention vs. FFN layers and layer indices) are already present in Section 4.3 and the appendix. To address the concern we will elevate the key ablation tables into the main results section and add error bars (from 3 random seeds) plus representative training curves to the revised high-level results and supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture claims rest on external benchmarks

full rationale

The paper describes an architectural change (trainable visual expert inserted into frozen LM attention/FFN layers) and supports its claims via direct empirical results on 10+ cross-modal benchmarks, with comparisons to external models such as PaLI-X 55B. No equations, derivations, or parameter-fitting steps are present that reduce by construction to the inputs. The assertion of preserved NLP performance is presented as an empirical outcome rather than a self-referential prediction or self-citation chain. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes that a small trainable module inserted into frozen transformer layers can learn deep cross-modal alignment without any modification to the language model's pretraining objective or architecture.

axioms (1)
  • domain assumption A frozen pretrained language model can serve as a fixed backbone for vision-language tasks when augmented by an internal visual expert.
    Stated in the abstract as the core design choice.

pith-pipeline@v0.9.0 · 5526 in / 1178 out tokens · 23582 ms · 2026-05-15T15:41:06.257046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  2. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  3. MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

  4. UIPress: Bringing Optical Token Compression to UI-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...

  5. State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

    cs.CV 2026-04 unverdicted novelty 6.0

    MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...

  6. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  7. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  8. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  9. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  10. Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    cs.CV 2024-10 conditional novelty 6.0

    Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine...

  11. SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    cs.CV 2024-10 accept novelty 6.0

    SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.

  12. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    cs.CV 2024-08 unverdicted novelty 6.0

    CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

  13. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  14. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    cs.CV 2024-03 conditional novelty 6.0

    Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

  15. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 6.0

    DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...

  16. MMBench: Is Your Multi-modal Model an All-around Player?

    cs.CV 2023-07 accept novelty 6.0

    MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

  17. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  18. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  19. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  20. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 20 Pith papers · 15 internal anchors

  1. [1]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y ., Zhu, W., Marathe, K., Bitton, Y ., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P. W., Ilharco, G., Worts- man, M., and Schmidt, L. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 ,

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv preprint arXiv:2308.12966,

  3. [3]

    Murel: Multimodal relational reasoning for visual ques- tion answering

    Cadene, R., Ben-Younes, H., Cord, M., and Thome, N. Murel: Multimodal relational reasoning for visual ques- tion answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pp. 1989–1998,

  4. [4]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a. Chen, T., Li, L., Saxena, S., Hinton, G., and Fleet, D. J. A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366 , 2022a. Chen, X., Wang, X....

  5. [5]

    Uni- versal captioner: Long-tail vision-and-language model training through content-style separation

    Cornia, M., Baraldi, L., Fiameni, G., and Cucchiara, R. Uni- versal captioner: Long-tail vision-and-language model training through content-style separation. arXiv preprint arXiv:2111.12727, 1(2):4,

  6. [6]

    Dreamllm: Syn- ergistic multimodal comprehension and creation

    Dong, R., Han, C., Peng, Y ., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., et al. Dreamllm: Syn- ergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499,

  7. [7]

    PaLM-E: An Embodied Multimodal Language Model

    Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,

  8. [8]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding. arXiv preprint arXiv:2009.03300,

  9. [9]

    and Johnson, M

    Honnibal, M. and Johnson, M. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1373–1378,

  10. [10]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

  11. [11]

    and Kanan, C

    9 CogVLM: Visual Expert for Pretrained Language Models Kafle, K. and Kanan, C. An analysis of visual question answering algorithms. In Proceedings of the IEEE inter- national conference on computer vision , pp. 1965–1973,

  12. [12]

    Referitgame: Referring to objects in photographs of natu- ral scenes

    Kazemzadeh, S., Ordonez, V ., Matten, M., and Berg, T. Referitgame: Referring to objects in photographs of natu- ral scenes. In Proceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP), pp. 787–798,

  13. [13]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., and Shan, Y . Seed-bench: Benchmarking multimodal llms with gener- ative comprehension. arXiv preprint arXiv:2307.16125, 2023a. Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot- strapping language-image pre-training with frozen im- age encoders and large language models. arXiv preprint arXiv:2301.12597, ...

  14. [14]

    Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

    Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023b. Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Aligning large multi-modal model with robust ins...

  15. [15]

    Prismer: A vision-language model with an ensemble of experts

    Liu, S., Fan, L., Johns, E., Yu, Z., Xiao, C., and Anand- kumar, A. Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506, 2023d. Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection...

  16. [16]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255,

  17. [17]

    K., and Chakraborty, A

    Mishra, A., Shekhar, S., Singh, A. K., and Chakraborty, A. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp. 947–952. IEEE,

  18. [18]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence...

  19. [19]

    GLU Variants Improve Transformer

    Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,

  20. [20]

    Textcaps: a dataset for image captioning with reading comprehen- sion

    Sidorov, O., Hu, R., Rohrbach, M., and Singh, A. Textcaps: a dataset for image captioning with reading comprehen- sion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part II 16, pp. 742–758. Springer,

  21. [21]

    Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a

    Sun, Q., Cui, Y ., Zhang, X., Zhang, F., Yu, Q., Luo, Z., Wang, Y ., Rao, Y ., Liu, J., Huang, T., et al. Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a. Sun, Q., Fang, Y ., Wu, L., Wang, X., and Cao, Y . Eva- clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023b. Touvron,...

  22. [22]

    Git: A generative image-to- text transformer for vision and language

    Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., and Wang, L. Git: A generative image-to- text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a. Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. Ofa: Unifying architec- tures, tasks, and modalities through a sim...

  23. [23]

    mplug-owl2: Revolution- izing multi-modal large language model with modality collaboration

    Ye, Q., Xu, H., Ye, J., Yan, M., Liu, H., Qian, Q., Zhang, J., Huang, F., and Zhou, J. mplug-owl2: Revolution- izing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257,

  24. [24]

    Ferret: Refer and Ground Anything Anywhere at Any Granularity

    You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.-F., and Yang, Y . Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704,

  25. [25]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    11 CogVLM: Visual Expert for Pretrained Language Models Yu, J., Wang, Z., Vasudevan, V ., Yeung, L., Seyedhosseini, M., and Wu, Y . Coca: Contrastive captioners are image- text foundation models. arXiv preprint arXiv:2205.01917,

  26. [26]

    C., and Berg, T

    Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amster- dam, The Netherlands, October 11-14, 2016, Proceed- ings, Part II 14, pp. 69–85. Springer,

  27. [27]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multi- modal models for integrated capabilities. arXiv preprint arXiv:2308.02490,

  28. [28]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502,

  29. [29]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

  30. [30]

    The Captions in COCO dataset are collected using Amazon’s Mechanical Turk (AMT) workers who are given instructions to control the quality. The dataset contains 330K images, where the train, validation and test sets contain 413,915 captions for 82,783 images, 202,520 captions for 40,504 images, and 379,249 captions for 40,775 images respectively. • NoCaps ...

  31. [31]

    in”, “near

    Summary of the evaluation benchmarks. Task Dataset Description Split Metrics Image Caption NoCaps Captioning of natural images. val CIDEr ( ↑) Flickr Captioning of natural images. karpathy-test CIDEr ( ↑) COCO Captioning of natural images. karpathy-test CIDEr ( ↑) TextCaps Captioning of natural images containing text. test CIDEr ( ↑) General VQA VQAv2 VQA...

  32. [32]

    which”-type 15 CogVLM: Visual Expert for Pretrained Language Models question, such as “Which is the small computer in the corner?

    The RefCOCOg subset was amassed through Amazon Mechanical Turk, where workers penned natural referring expressions for objects in MSCOCO images; it boasts 85,474 referring expressions spanning 26,711 images, each containing 2 to 4 objects of the same category. • Visual7W (Zhu et al., 2016). The Visual7W dataset is predominantly designed for VQA tasks, wit...

  33. [33]

    Performance on TDIUC benchmark with fine-grained questions classes. B. Additional Fine-grained Experiments To comprehensively investigate the proposed model on specific topics and question types, we further conduct extensive experiments on a representative benchmark, TDIUC (Kafle & Kanan, 2017). We use the publicly available split of val set as evaluation...