arxiv: 2311.03079 · v2 · submitted 2023-11-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang , Qingsong Lv , Wenmeng Yu , Wenyi Hong , Ji Qi , Yan Wang , Junhui Ji , Zhuoyi Yang

show 8 more authors

Lei Zhao Xixuan Song Jiazheng Xu Bin Xu Juanzi Li Yuxiao Dong Ming Ding Jie Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords CogVLMvisual expert moduledeep fusionvision language modelfrozen language modelmultimodal benchmarkscross-modal taskspretrained models

0 comments

The pith

A trainable visual expert module inserted into the attention and FFN layers of a frozen language model enables deep vision-language fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CogVLM adds a trainable visual expert module directly into the attention and feed-forward layers of a frozen pretrained language model. This differs from shallow alignment methods that only map image features at the input and instead allows vision and language features to interact deeply throughout the network. The resulting 17-billion-parameter model reaches state-of-the-art results on ten cross-modal benchmarks such as NoCaps, Flickr30k captioning, RefCOCO series, GQA, ScienceQA, and VizWiz while preserving full performance on pure language tasks. It matches or exceeds the results of much larger models like PaLI-X 55B on these tasks. The approach requires no changes to the original language model architecture or loss functions.

Core claim

CogVLM shows that inserting a trainable visual expert module into the attention and FFN layers of any frozen pretrained language model bridges the gap between vision and language representations. This produces deep feature fusion across modalities without sacrificing the language model's original capabilities or requiring architectural modifications. The 17B CogVLM model then delivers state-of-the-art performance on ten classic cross-modal benchmarks including NoCaps, Flickr30k, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks second on VQAv2, OKVQA, TextVQA and COCO captioning, matching or surpassing the 55B PaLI-X model.

What carries the argument

The visual expert module, a trainable component inserted into the attention and FFN layers of the frozen language model to enable deep vision-language feature fusion.

If this is right

The 17B model achieves state-of-the-art results on ten cross-modal benchmarks including NoCaps, Flickr30k captioning, RefCOCO series, GQA, ScienceQA and VizWiz VQA.
It ranks second on VQAv2, OKVQA, TextVQA and COCO captioning while matching or exceeding the 55B PaLI-X model.
Full performance on pure NLP tasks is retained with no architectural changes to the base language model.
Codes and checkpoints are released as open source for further use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The insertion technique may apply to other frozen pretrained models beyond the specific base used in the experiments.
Independent scaling of the visual expert size could be tested as a way to improve efficiency further.
The method points toward building multimodal systems by layering task-specific experts onto existing large language models rather than training everything from scratch.

Load-bearing premise

The visual expert module can be inserted into the attention and FFN layers of any frozen pretrained language model without requiring changes to the original architecture or loss functions.

What would settle it

A controlled experiment showing that inserting the visual expert either degrades accuracy on standard NLP benchmarks or fails to outperform simple input-space alignment methods on vision-language tasks would disprove the claim.

read the original abstract

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CogVLM, a 17B-parameter visual-language model that inserts a trainable visual expert module into the attention and FFN layers of a frozen pretrained language model (e.g., Vicuna or LLaMA). This enables deep vision-language fusion while claiming to preserve the original NLP capabilities of the base model without architectural changes or loss-function modifications. The model reports state-of-the-art results on 10 cross-modal benchmarks (NoCaps, Flickr30k, RefCOCO variants, Visual7W, GQA, ScienceQA, VizWiz, TDIUC) and competitive performance on VQAv2, OKVQA, TextVQA, and COCO captioning, matching or exceeding PaLI-X 55B.

Significance. If the no-sacrifice claim on NLP tasks is substantiated, the approach would offer an efficient route to strong multimodal performance by leveraging existing frozen LMs, avoiding full retraining costs. The open-source release of code and checkpoints strengthens reproducibility and potential adoption.

major comments (2)

[Abstract and results] Abstract and results section: The central claim that insertion of the visual expert 'does not sacrifice any performance on NLP tasks' is load-bearing for the method's value proposition, yet the manuscript provides no quantitative before/after comparison on the base LM's original NLP benchmarks (e.g., MMLU, HumanEval, or Vicuna test suite). Only VLM benchmark numbers are reported.
[Experiments] Experimental setup: No error bars, ablation studies on the visual expert placement, or training curves are mentioned in the abstract or high-level results, making it impossible to assess whether post-hoc protocol choices affect the reported SOTA rankings.

minor comments (2)

[Abstract] The abstract lists 10 benchmarks but does not specify the exact evaluation protocols or splits used for each; this should be clarified for reproducibility.
[Figures/Tables] Figure and table captions could more explicitly state whether results are zero-shot or fine-tuned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and results] Abstract and results section: The central claim that insertion of the visual expert 'does not sacrifice any performance on NLP tasks' is load-bearing for the method's value proposition, yet the manuscript provides no quantitative before/after comparison on the base LM's original NLP benchmarks (e.g., MMLU, HumanEval, or Vicuna test suite). Only VLM benchmark numbers are reported.

Authors: We agree that explicit quantitative evidence is needed to substantiate the no-sacrifice claim. The design freezes the language model weights, which by construction preserves NLP behavior, but we did not report direct before/after numbers on standard NLP suites in the initial submission. In the revision we will add comparisons on MMLU and a Vicuna-style evaluation subset to directly demonstrate preservation of performance. revision: yes
Referee: [Experiments] Experimental setup: No error bars, ablation studies on the visual expert placement, or training curves are mentioned in the abstract or high-level results, making it impossible to assess whether post-hoc protocol choices affect the reported SOTA rankings.

Authors: Ablation studies on visual expert placement (attention vs. FFN layers and layer indices) are already present in Section 4.3 and the appendix. To address the concern we will elevate the key ablation tables into the main results section and add error bars (from 3 random seeds) plus representative training curves to the revised high-level results and supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture claims rest on external benchmarks

full rationale

The paper describes an architectural change (trainable visual expert inserted into frozen LM attention/FFN layers) and supports its claims via direct empirical results on 10+ cross-modal benchmarks, with comparisons to external models such as PaLI-X 55B. No equations, derivations, or parameter-fitting steps are present that reduce by construction to the inputs. The assertion of preserved NLP performance is presented as an empirical outcome rather than a self-referential prediction or self-citation chain. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes that a small trainable module inserted into frozen transformer layers can learn deep cross-modal alignment without any modification to the language model's pretraining objective or architecture.

axioms (1)

domain assumption A frozen pretrained language model can serve as a fixed backbone for vision-language tasks when augmented by an internal visual expert.
Stated in the abstract as the core design choice.

pith-pipeline@v0.9.0 · 5526 in / 1178 out tokens · 23582 ms · 2026-05-15T15:41:06.257046+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CogVLM bridges the gap ... by a trainable visual expert module in the attention and FFN layers ... without sacrificing any performance on NLP tasks.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the original language model are fixed, the behaviors are the same as in the original language model if the input sequence contains no image

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 7.0

MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
cs.CV 2024-10 conditional novelty 6.0

Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine...
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
cs.CV 2024-10 accept novelty 6.0

SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
cs.CV 2024-03 conditional novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 20 Pith papers · 15 internal anchors

[1]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y ., Zhu, W., Marathe, K., Bitton, Y ., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P. W., Ilharco, G., Worts- man, M., and Schmidt, L. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Murel: Multimodal relational reasoning for visual ques- tion answering

Cadene, R., Ben-Younes, H., Cord, M., and Thome, N. Murel: Multimodal relational reasoning for visual ques- tion answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pp. 1989–1998,

work page 1989
[4]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a. Chen, T., Li, L., Saxena, S., Hinton, G., and Fleet, D. J. A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366 , 2022a. Chen, X., Wang, X....

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Uni- versal captioner: Long-tail vision-and-language model training through content-style separation

Cornia, M., Baraldi, L., Fiameni, G., and Cucchiara, R. Uni- versal captioner: Long-tail vision-and-language model training through content-style separation. arXiv preprint arXiv:2111.12727, 1(2):4,

work page arXiv
[6]

Dreamllm: Syn- ergistic multimodal comprehension and creation

Dong, R., Han, C., Peng, Y ., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., et al. Dreamllm: Syn- ergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499,

work page arXiv
[7]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

and Johnson, M

Honnibal, M. and Johnson, M. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1373–1378,

work page 2015
[10]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

and Kanan, C

9 CogVLM: Visual Expert for Pretrained Language Models Kafle, K. and Kanan, C. An analysis of visual question answering algorithms. In Proceedings of the IEEE inter- national conference on computer vision , pp. 1965–1973,

work page 1965
[12]

Referitgame: Referring to objects in photographs of natu- ral scenes

Kazemzadeh, S., Ordonez, V ., Matten, M., and Berg, T. Referitgame: Referring to objects in photographs of natu- ral scenes. In Proceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP), pp. 787–798,

work page 2014
[13]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., and Shan, Y . Seed-bench: Benchmarking multimodal llms with gener- ative comprehension. arXiv preprint arXiv:2307.16125, 2023a. Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot- strapping language-image pre-training with frozen im- age encoders and large language models. arXiv preprint arXiv:2301.12597, ...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023b. Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Aligning large multi-modal model with robust ins...

work page arXiv
[15]

Prismer: A vision-language model with an ensemble of experts

Liu, S., Fan, L., Johns, E., Yu, Z., Xiao, C., and Anand- kumar, A. Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506, 2023d. Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection...

work page arXiv
[16]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

K., and Chakraborty, A

Mishra, A., Shekhar, S., Singh, A. K., and Chakraborty, A. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp. 947–952. IEEE,

work page 2019
[18]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

GLU Variants Improve Transformer

Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[20]

Textcaps: a dataset for image captioning with reading comprehen- sion

Sidorov, O., Hu, R., Rohrbach, M., and Singh, A. Textcaps: a dataset for image captioning with reading comprehen- sion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part II 16, pp. 742–758. Springer,

work page 2020
[21]

Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a

Sun, Q., Cui, Y ., Zhang, X., Zhang, F., Yu, Q., Luo, Z., Wang, Y ., Rao, Y ., Liu, J., Huang, T., et al. Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a. Sun, Q., Fang, Y ., Wu, L., Wang, X., and Cao, Y . Eva- clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023b. Touvron,...

work page arXiv
[22]

Git: A generative image-to- text transformer for vision and language

Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., and Wang, L. Git: A generative image-to- text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a. Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. Ofa: Unifying architec- tures, tasks, and modalities through a sim...

work page arXiv
[23]

mplug-owl2: Revolution- izing multi-modal large language model with modality collaboration

Ye, Q., Xu, H., Ye, J., Yan, M., Liu, H., Qian, Q., Zhang, J., Huang, F., and Zhou, J. mplug-owl2: Revolution- izing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257,

work page arXiv
[24]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.-F., and Yang, Y . Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

CoCa: Contrastive Captioners are Image-Text Foundation Models

11 CogVLM: Visual Expert for Pretrained Language Models Yu, J., Wang, Z., Vasudevan, V ., Yeung, L., Seyedhosseini, M., and Wu, Y . Coca: Contrastive captioners are image- text foundation models. arXiv preprint arXiv:2205.01917,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

C., and Berg, T

Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amster- dam, The Netherlands, October 11-14, 2016, Proceed- ings, Part II 14, pp. 69–85. Springer,

work page 2016
[27]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multi- modal models for integrated capabilities. arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

The Captions in COCO dataset are collected using Amazon’s Mechanical Turk (AMT) workers who are given instructions to control the quality. The dataset contains 330K images, where the train, validation and test sets contain 413,915 captions for 82,783 images, 202,520 captions for 40,504 images, and 379,249 captions for 40,775 images respectively. • NoCaps ...

work page 2019
[31]

in”, “near

Summary of the evaluation benchmarks. Task Dataset Description Split Metrics Image Caption NoCaps Captioning of natural images. val CIDEr ( ↑) Flickr Captioning of natural images. karpathy-test CIDEr ( ↑) COCO Captioning of natural images. karpathy-test CIDEr ( ↑) TextCaps Captioning of natural images containing text. test CIDEr ( ↑) General VQA VQAv2 VQA...

work page 2015
[32]

which”-type 15 CogVLM: Visual Expert for Pretrained Language Models question, such as “Which is the small computer in the corner?

The RefCOCOg subset was amassed through Amazon Mechanical Turk, where workers penned natural referring expressions for objects in MSCOCO images; it boasts 85,474 referring expressions spanning 26,711 images, each containing 2 to 4 objects of the same category. • Visual7W (Zhu et al., 2016). The Visual7W dataset is predominantly designed for VQA tasks, wit...

work page 2016
[33]

Performance on TDIUC benchmark with fine-grained questions classes. B. Additional Fine-grained Experiments To comprehensively investigate the proposed model on specific topics and question types, we further conduct extensive experiments on a representative benchmark, TDIUC (Kafle & Kanan, 2017). We use the publicly available split of val set as evaluation...

work page 2017