Recognition: 2 theorem links
· Lean TheoremCogVLM: Visual Expert for Pretrained Language Models
Pith reviewed 2026-05-15 15:41 UTC · model grok-4.3
The pith
A trainable visual expert module inserted into the attention and FFN layers of a frozen language model enables deep vision-language fusion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CogVLM shows that inserting a trainable visual expert module into the attention and FFN layers of any frozen pretrained language model bridges the gap between vision and language representations. This produces deep feature fusion across modalities without sacrificing the language model's original capabilities or requiring architectural modifications. The 17B CogVLM model then delivers state-of-the-art performance on ten classic cross-modal benchmarks including NoCaps, Flickr30k, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks second on VQAv2, OKVQA, TextVQA and COCO captioning, matching or surpassing the 55B PaLI-X model.
What carries the argument
The visual expert module, a trainable component inserted into the attention and FFN layers of the frozen language model to enable deep vision-language feature fusion.
If this is right
- The 17B model achieves state-of-the-art results on ten cross-modal benchmarks including NoCaps, Flickr30k captioning, RefCOCO series, GQA, ScienceQA and VizWiz VQA.
- It ranks second on VQAv2, OKVQA, TextVQA and COCO captioning while matching or exceeding the 55B PaLI-X model.
- Full performance on pure NLP tasks is retained with no architectural changes to the base language model.
- Codes and checkpoints are released as open source for further use.
Where Pith is reading between the lines
- The insertion technique may apply to other frozen pretrained models beyond the specific base used in the experiments.
- Independent scaling of the visual expert size could be tested as a way to improve efficiency further.
- The method points toward building multimodal systems by layering task-specific experts onto existing large language models rather than training everything from scratch.
Load-bearing premise
The visual expert module can be inserted into the attention and FFN layers of any frozen pretrained language model without requiring changes to the original architecture or loss functions.
What would settle it
A controlled experiment showing that inserting the visual expert either degrades accuracy on standard NLP benchmarks or fails to outperform simple input-space alignment methods on vision-language tasks would disprove the claim.
read the original abstract
We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CogVLM, a 17B-parameter visual-language model that inserts a trainable visual expert module into the attention and FFN layers of a frozen pretrained language model (e.g., Vicuna or LLaMA). This enables deep vision-language fusion while claiming to preserve the original NLP capabilities of the base model without architectural changes or loss-function modifications. The model reports state-of-the-art results on 10 cross-modal benchmarks (NoCaps, Flickr30k, RefCOCO variants, Visual7W, GQA, ScienceQA, VizWiz, TDIUC) and competitive performance on VQAv2, OKVQA, TextVQA, and COCO captioning, matching or exceeding PaLI-X 55B.
Significance. If the no-sacrifice claim on NLP tasks is substantiated, the approach would offer an efficient route to strong multimodal performance by leveraging existing frozen LMs, avoiding full retraining costs. The open-source release of code and checkpoints strengthens reproducibility and potential adoption.
major comments (2)
- [Abstract and results] Abstract and results section: The central claim that insertion of the visual expert 'does not sacrifice any performance on NLP tasks' is load-bearing for the method's value proposition, yet the manuscript provides no quantitative before/after comparison on the base LM's original NLP benchmarks (e.g., MMLU, HumanEval, or Vicuna test suite). Only VLM benchmark numbers are reported.
- [Experiments] Experimental setup: No error bars, ablation studies on the visual expert placement, or training curves are mentioned in the abstract or high-level results, making it impossible to assess whether post-hoc protocol choices affect the reported SOTA rankings.
minor comments (2)
- [Abstract] The abstract lists 10 benchmarks but does not specify the exact evaluation protocols or splits used for each; this should be clarified for reproducibility.
- [Figures/Tables] Figure and table captions could more explicitly state whether results are zero-shot or fine-tuned.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and results] Abstract and results section: The central claim that insertion of the visual expert 'does not sacrifice any performance on NLP tasks' is load-bearing for the method's value proposition, yet the manuscript provides no quantitative before/after comparison on the base LM's original NLP benchmarks (e.g., MMLU, HumanEval, or Vicuna test suite). Only VLM benchmark numbers are reported.
Authors: We agree that explicit quantitative evidence is needed to substantiate the no-sacrifice claim. The design freezes the language model weights, which by construction preserves NLP behavior, but we did not report direct before/after numbers on standard NLP suites in the initial submission. In the revision we will add comparisons on MMLU and a Vicuna-style evaluation subset to directly demonstrate preservation of performance. revision: yes
-
Referee: [Experiments] Experimental setup: No error bars, ablation studies on the visual expert placement, or training curves are mentioned in the abstract or high-level results, making it impossible to assess whether post-hoc protocol choices affect the reported SOTA rankings.
Authors: Ablation studies on visual expert placement (attention vs. FFN layers and layer indices) are already present in Section 4.3 and the appendix. To address the concern we will elevate the key ablation tables into the main results section and add error bars (from 3 random seeds) plus representative training curves to the revised high-level results and supplementary material. revision: partial
Circularity Check
No circularity: empirical architecture claims rest on external benchmarks
full rationale
The paper describes an architectural change (trainable visual expert inserted into frozen LM attention/FFN layers) and supports its claims via direct empirical results on 10+ cross-modal benchmarks, with comparisons to external models such as PaLI-X 55B. No equations, derivations, or parameter-fitting steps are present that reduce by construction to the inputs. The assertion of preserved NLP performance is presented as an empirical outcome rather than a self-referential prediction or self-citation chain. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A frozen pretrained language model can serve as a fixed backbone for vision-language tasks when augmented by an internal visual expert.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CogVLM bridges the gap ... by a trainable visual expert module in the attention and FFN layers ... without sacrificing any performance on NLP tasks.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the original language model are fixed, the behaviors are the same as in the original language model if the input sequence contains no image
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
-
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine...
-
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Reference graph
Works this paper leans on
-
[1]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y ., Zhu, W., Marathe, K., Bitton, Y ., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P. W., Ilharco, G., Worts- man, M., and Schmidt, L. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv preprint arXiv:2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Murel: Multimodal relational reasoning for visual ques- tion answering
Cadene, R., Ben-Younes, H., Cord, M., and Thome, N. Murel: Multimodal relational reasoning for visual ques- tion answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pp. 1989–1998,
work page 1989
-
[4]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a. Chen, T., Li, L., Saxena, S., Hinton, G., and Fleet, D. J. A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366 , 2022a. Chen, X., Wang, X....
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Uni- versal captioner: Long-tail vision-and-language model training through content-style separation
Cornia, M., Baraldi, L., Fiameni, G., and Cucchiara, R. Uni- versal captioner: Long-tail vision-and-language model training through content-style separation. arXiv preprint arXiv:2111.12727, 1(2):4,
-
[6]
Dreamllm: Syn- ergistic multimodal comprehension and creation
Dong, R., Han, C., Peng, Y ., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., et al. Dreamllm: Syn- ergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499,
-
[7]
PaLM-E: An Embodied Multimodal Language Model
Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding. arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[9]
Honnibal, M. and Johnson, M. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1373–1378,
work page 2015
-
[10]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
9 CogVLM: Visual Expert for Pretrained Language Models Kafle, K. and Kanan, C. An analysis of visual question answering algorithms. In Proceedings of the IEEE inter- national conference on computer vision , pp. 1965–1973,
work page 1965
-
[12]
Referitgame: Referring to objects in photographs of natu- ral scenes
Kazemzadeh, S., Ordonez, V ., Matten, M., and Berg, T. Referitgame: Referring to objects in photographs of natu- ral scenes. In Proceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP), pp. 787–798,
work page 2014
-
[13]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., and Shan, Y . Seed-bench: Benchmarking multimodal llms with gener- ative comprehension. arXiv preprint arXiv:2307.16125, 2023a. Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot- strapping language-image pre-training with frozen im- age encoders and large language models. arXiv preprint arXiv:2301.12597, ...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023b. Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Aligning large multi-modal model with robust ins...
-
[15]
Prismer: A vision-language model with an ensemble of experts
Liu, S., Fan, L., Johns, E., Yu, Z., Xiao, C., and Anand- kumar, A. Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506, 2023d. Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection...
-
[16]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Mishra, A., Shekhar, S., Singh, A. K., and Chakraborty, A. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp. 947–952. IEEE,
work page 2019
-
[18]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
GLU Variants Improve Transformer
Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[20]
Textcaps: a dataset for image captioning with reading comprehen- sion
Sidorov, O., Hu, R., Rohrbach, M., and Singh, A. Textcaps: a dataset for image captioning with reading comprehen- sion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part II 16, pp. 742–758. Springer,
work page 2020
-
[21]
Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a
Sun, Q., Cui, Y ., Zhang, X., Zhang, F., Yu, Q., Luo, Z., Wang, Y ., Rao, Y ., Liu, J., Huang, T., et al. Generative multimodal models are in-context learners.arXiv preprint arXiv:2312.13286, 2023a. Sun, Q., Fang, Y ., Wu, L., Wang, X., and Cao, Y . Eva- clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023b. Touvron,...
-
[22]
GIT: A generative image-to-text transformer for vision and language
Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., and Wang, L. Git: A generative image-to- text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a. Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. Ofa: Unifying architec- tures, tasks, and modalities through a sim...
-
[23]
mplug-owl2: Revolution- izing multi-modal large language model with modality collaboration
Ye, Q., Xu, H., Ye, J., Yan, M., Liu, H., Qian, Q., Zhang, J., Huang, F., and Zhou, J. mplug-owl2: Revolution- izing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257,
-
[24]
Ferret: Refer and Ground Anything Anywhere at Any Granularity
You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.-F., and Yang, Y . Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
CoCa: Contrastive Captioners are Image-Text Foundation Models
11 CogVLM: Visual Expert for Pretrained Language Models Yu, J., Wang, Z., Vasudevan, V ., Yeung, L., Seyedhosseini, M., and Wu, Y . Coca: Contrastive captioners are image- text foundation models. arXiv preprint arXiv:2205.01917,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amster- dam, The Netherlands, October 11-14, 2016, Proceed- ings, Part II 14, pp. 69–85. Springer,
work page 2016
-
[27]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multi- modal models for integrated capabilities. arXiv preprint arXiv:2308.02490,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
The Captions in COCO dataset are collected using Amazon’s Mechanical Turk (AMT) workers who are given instructions to control the quality. The dataset contains 330K images, where the train, validation and test sets contain 413,915 captions for 82,783 images, 202,520 captions for 40,504 images, and 379,249 captions for 40,775 images respectively. • NoCaps ...
work page 2019
-
[31]
Summary of the evaluation benchmarks. Task Dataset Description Split Metrics Image Caption NoCaps Captioning of natural images. val CIDEr ( ↑) Flickr Captioning of natural images. karpathy-test CIDEr ( ↑) COCO Captioning of natural images. karpathy-test CIDEr ( ↑) TextCaps Captioning of natural images containing text. test CIDEr ( ↑) General VQA VQAv2 VQA...
work page 2015
-
[32]
The RefCOCOg subset was amassed through Amazon Mechanical Turk, where workers penned natural referring expressions for objects in MSCOCO images; it boasts 85,474 referring expressions spanning 26,711 images, each containing 2 to 4 objects of the same category. • Visual7W (Zhu et al., 2016). The Visual7W dataset is predominantly designed for VQA tasks, wit...
work page 2016
-
[33]
Performance on TDIUC benchmark with fine-grained questions classes. B. Additional Fine-grained Experiments To comprehensively investigate the proposed model on specific topics and question types, we further conduct extensive experiments on a representative benchmark, TDIUC (Kafle & Kanan, 2017). We use the publicly available split of val set as evaluation...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.