pith. machine review for the scientific record. sign in

arxiv: 2310.09478 · v3 · submitted 2023-10-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords MiniGPT-v2vision-language modelsmulti-task learningvisual question answeringvisual groundinglarge language modelsunified interface
0
0 comments X

The pith

MiniGPT-v2 uses unique task identifiers to let one large language model handle many vision-language tasks at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build a single model that can describe images, answer questions about them, locate objects, and perform similar tasks using simple instructions. It does so by inserting special identifiers that label each task during training, which helps the model separate the instructions and learn them side by side. The authors train the model in three stages and then test it on standard visual question-answering and visual grounding benchmarks. If the identifiers work as intended, the same model parameters can serve as a general interface instead of requiring separate models for each vision-language job.

Core claim

MiniGPT-v2 treats a large language model as a unified interface for vision-language tasks by assigning unique identifiers to each task type when training. These identifiers allow the model to distinguish instructions effortlessly and improve learning efficiency across tasks. After three-stage training, the model records strong results on multiple visual question-answering and visual grounding benchmarks relative to other vision-language generalist models.

What carries the argument

Unique task identifiers assigned during training that mark each vision-language task so the model can separate instructions and avoid interference.

If this is right

  • A single set of model weights can perform image description, visual question answering, and visual grounding.
  • Task interference is reduced because each instruction carries an explicit label.
  • The same training recipe can be applied to additional vision-language tasks without building new models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The identifier approach could be tested on video or audio tasks to see if one model can handle still more modalities.
  • Removing the identifiers after training might reveal how much the tags contribute to final performance.
  • This style of tagging could simplify scaling to new tasks by adding only a new identifier rather than retraining from scratch.

Load-bearing premise

Giving each task a unique identifier lets the model tell the tasks apart and learn them efficiently without one task hurting another.

What would settle it

Train an otherwise identical model without the task identifiers and check whether accuracy falls on individual benchmarks such as visual question answering or grounding.

read the original abstract

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MiniGPT-v2, a vision-language model positioned as a unified interface for multiple tasks including image description, visual question answering (VQA), and visual grounding. It proposes the use of unique task identifier tokens during a three-stage training procedure to help the model distinguish instructions and improve per-task learning efficiency, claiming that the resulting model achieves strong benchmark performance relative to other vision-language generalist models.

Significance. If the performance claims are substantiated, the work would provide a lightweight mechanism for multi-task VL learning that avoids complex task-specific heads or routing, potentially simplifying the construction of generalist models. The public release of model weights and code would further support reproducibility and follow-up research.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method): The central claim that unique task identifiers 'enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task' is presented without any ablation that isolates their contribution. No comparison to an otherwise identical model trained without identifiers is reported, leaving open the possibility that observed gains arise from data scale, the base LLM, or other unablated factors rather than the identifier mechanism.
  2. [§4] §4 (experiments): The abstract asserts 'strong performance on many visual question-answering and visual grounding benchmarks' after three-stage training, yet the provided text supplies no quantitative numbers, baseline tables, or statistical significance tests. Without these details it is impossible to judge the magnitude or reliability of the claimed improvements over prior generalist models.
minor comments (2)
  1. [§3] The three training stages are referenced but their exact data mixtures, learning-rate schedules, and loss weighting are not fully specified in the abstract; a concise table summarizing stage-wise objectives and data sources would improve clarity.
  2. [§3] Notation for the task identifier tokens (e.g., how they are embedded and concatenated with visual and textual tokens) should be formalized with an equation or diagram to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and have revised the manuscript to strengthen the presentation of the task-identifier mechanism and the experimental results.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): The central claim that unique task identifiers 'enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task' is presented without any ablation that isolates their contribution. No comparison to an otherwise identical model trained without identifiers is reported, leaving open the possibility that observed gains arise from data scale, the base LLM, or other unablated factors rather than the identifier mechanism.

    Authors: We agree that an explicit ablation isolating the identifiers would strengthen the claim. In the revised manuscript we have added a controlled ablation (same data, same base LLM, same training stages) that directly compares performance with and without the task-identifier tokens. The results show measurable gains in both task distinction (lower confusion across instructions) and final benchmark scores, confirming the identifiers' contribution beyond data scale or model choice. revision: yes

  2. Referee: [§4] §4 (experiments): The abstract asserts 'strong performance on many visual question-answering and visual grounding benchmarks' after three-stage training, yet the provided text supplies no quantitative numbers, baseline tables, or statistical significance tests. Without these details it is impossible to judge the magnitude or reliability of the claimed improvements over prior generalist models.

    Authors: We apologize for the omission in the submitted version. Section 4 and the associated tables have been expanded to include all quantitative results (exact accuracy, CIDEr, mIoU, etc.), direct comparisons against the cited generalist baselines, and statistical significance tests where sample sizes permit. Key tables are now placed in the main text rather than the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training results with no self-referential derivation

full rationale

The paper describes a three-stage training procedure for MiniGPT-v2 that incorporates unique task identifiers as a modeling choice. Performance claims on VQA and grounding benchmarks are presented strictly as experimental outcomes after training, not as quantities derived from equations or parameters that reduce to the inputs by construction. No self-definitional steps, fitted inputs relabeled as predictions, or load-bearing self-citations appear in the abstract or described chain. The identifier mechanism is an architectural decision whose benefits are asserted but not mathematically forced from prior results within the paper itself. This is a standard empirical ML contribution whose central results remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of task identifiers plus standard vision-language pre-training; no new physical entities or formal axioms are introduced.

free parameters (1)
  • task identifier tokens
    Short unique strings assigned to each task; chosen by the authors and treated as fixed inputs during training.
axioms (1)
  • domain assumption Large language models can be extended to process images via an image encoder and instruction tuning.
    Invoked when the abstract states the model is built on top of an LLM for vision-language tasks.

pith-pipeline@v0.9.0 · 5504 in / 1191 out tokens · 35011 ms · 2026-05-16T07:08:36.173359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

    cs.CV 2026-05 unverdicted novelty 8.0

    VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.

  2. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    cs.CV 2024-08 conditional novelty 8.0

    MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

  3. Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

    cs.CY 2026-05 unverdicted novelty 7.0

    A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.

  4. STORM: End-to-End Referring Multi-Object Tracking in Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.

  5. Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

  6. SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

    cs.CV 2026-04 unverdicted novelty 7.0

    SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

  7. SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.

  8. Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

    cs.CR 2026-05 unverdicted novelty 6.0

    DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.

  9. SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.

  10. POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

  11. BLINK: Multimodal Large Language Models Can See but Not Perceive

    cs.CV 2024-04 accept novelty 6.0

    BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.

  12. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    cs.CV 2023-11 conditional novelty 6.0

    A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

  13. AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    cs.CL 2023-11 unverdicted novelty 6.0

    AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.

  14. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    cs.CV 2023-06 accept novelty 6.0

    A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

  15. StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    StateVLM uses an Auxiliary Regression Loss on box decoder outputs to boost VLMs' accuracy on object and state localization for robotic affordance reasoning, with gains of 1.6% on RefCOCO variants and 5.2% on the new O...

  16. Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

    cs.CV 2026-04 unverdicted novelty 5.0

    MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.

  17. Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

    cs.CV 2026-04 unverdicted novelty 5.0

    RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.

  18. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  19. A Survey on Hallucination in Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 3.0

    This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 19 Pith papers · 22 internal anchors

  1. [1]

    https://github.com/domeccleston/sharegpt, 2023

    Sharegpt. https://github.com/domeccleston/sharegpt, 2023

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  5. [5]

    Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

    Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022

  6. [6]

    Video chatcaptioner: Towards the enriched spatiotemporal descriptions

    Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, and Mohamed Elhoseiny. Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227, 2023

  7. [7]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023

  8. [8]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

  9. [9]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  10. [10]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

  11. [11]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  12. [12]

    Eva: Exploring the limits of masked visual representation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022

  13. [13]

    Multimodal-gpt: A vision and language model for dialogue with humans

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023

  14. [14]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  15. [15]

    Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

  16. [16]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  17. [17]

    Unnatural instructions: Tuning language models with (almost) no human labor

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022

  18. [18]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  19. [19]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  20. [20]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

  21. [21]

    The hateful memes challenge: Detecting hate speech in multimodal memes

    Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems, 33:2611–2624, 2020

  22. [22]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

  23. [23]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023

  24. [24]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 10

  25. [25]

    Visual spatial reasoning

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651, 2023

  26. [26]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  27. [27]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023

  28. [28]

    Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021

  29. [29]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

  30. [30]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

  31. [31]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019

  32. [32]

    Introducing chatgpt

    OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022

  33. [33]

    Im2text: Describing images using 1 million captioned photographs

    Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011

  34. [34]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  35. [35]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023

  36. [36]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

  37. [37]

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

  38. [38]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  39. [39]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

  40. [40]

    Object Hallucination in Image Captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018

  41. [41]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022

  42. [42]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

  43. [43]

    A- okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A- okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022

  44. [44]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018

  45. [45]

    Textcaps: a dataset for image captioningwith reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioningwith reading comprehension. 2020

  46. [46]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022

  47. [47]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023

  48. [48]

    Introducing mpt-7b: A new standard for open-source, commercially usable llms,

    MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms,

  49. [49]

    Accessed: 2023-05-05

  50. [50]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation 11 language models. arXiv preprint arXiv:2302.13971, 2023

  51. [51]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  52. [52]

    Multimodal few-shot learning with frozen language models

    Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems , 34:200–212, 2021

  53. [53]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to- sequence learning framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to- sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022

  54. [54]

    Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

    Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023

  55. [55]

    Universal instance perception as object discovery and retrieval

    Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15325–15336, 2023

  56. [56]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

  57. [57]

    Modeling context in referring expressions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016

  58. [58]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

  59. [59]

    Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions

    Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, and Mohamed Elhoseiny. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023

  60. [60]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

  61. [61]

    Mindstorms in natural language-based societies of mind

    Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023. 12 A Appendix In the supplementary, we provide more qualitative results that ar...