arxiv: 2310.09478 · v3 · submitted 2023-10-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen , Deyao Zhu , Xiaoqian Shen , Xiang Li , Zechun Liu , Pengchuan Zhang , Raghuraman Krishnamoorthi , Vikas Chandra

show 2 more authors

Yunyang Xiong Mohamed Elhoseiny

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords MiniGPT-v2vision-language modelsmulti-task learningvisual question answeringvisual groundinglarge language modelsunified interface

0 comments

The pith

MiniGPT-v2 uses unique task identifiers to let one large language model handle many vision-language tasks at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build a single model that can describe images, answer questions about them, locate objects, and perform similar tasks using simple instructions. It does so by inserting special identifiers that label each task during training, which helps the model separate the instructions and learn them side by side. The authors train the model in three stages and then test it on standard visual question-answering and visual grounding benchmarks. If the identifiers work as intended, the same model parameters can serve as a general interface instead of requiring separate models for each vision-language job.

Core claim

MiniGPT-v2 treats a large language model as a unified interface for vision-language tasks by assigning unique identifiers to each task type when training. These identifiers allow the model to distinguish instructions effortlessly and improve learning efficiency across tasks. After three-stage training, the model records strong results on multiple visual question-answering and visual grounding benchmarks relative to other vision-language generalist models.

What carries the argument

Unique task identifiers assigned during training that mark each vision-language task so the model can separate instructions and avoid interference.

If this is right

A single set of model weights can perform image description, visual question answering, and visual grounding.
Task interference is reduced because each instruction carries an explicit label.
The same training recipe can be applied to additional vision-language tasks without building new models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identifier approach could be tested on video or audio tasks to see if one model can handle still more modalities.
Removing the identifiers after training might reveal how much the tags contribute to final performance.
This style of tagging could simplify scaling to new tasks by adding only a new identifier rather than retraining from scratch.

Load-bearing premise

Giving each task a unique identifier lets the model tell the tasks apart and learn them efficiently without one task hurting another.

What would settle it

Train an otherwise identical model without the task identifiers and check whether accuracy falls on individual benchmarks such as visual question answering or grounding.

read the original abstract

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiniGPT-v2 tags instructions with task-specific tokens on a MiniGPT base and claims competitive multi-task results after three-stage training, but the abstract gives no scores or controls to show the tokens actually help.

read the letter

The main takeaway is that this paper takes the instruction-tuning trick from language models and adds unique identifier tokens for each vision-language task so one model can handle captioning, VQA, and grounding through simple prompts. It trains in three stages and says the results beat other generalist models on standard benchmarks. They also release the model and code, which is the practical part worth noting right away. That release lets people try the unified interface without starting from scratch. The approach itself is a direct carry-over rather than a new architecture, so the novelty sits mostly in the concrete combination and the training schedule. The soft spot is exactly what the stress-test flags: the abstract asserts that the identifiers improve distinction and reduce interference, yet it contains no ablation, no baseline without the tokens, and no numbers at all. Without those details it is impossible to credit the gains to the identifiers instead of data volume or the base LLM. If the full paper has the tables and controls, that would change the picture, but nothing in the provided text shows it. This is for labs already working on generalist vision-language models who want a lightweight way to route tasks. A reader focused on deployment or quick multi-task experiments would get some value from the open release, but only after seeing the actual experiments. I would send it to peer review because the core idea is simple to check and the code lowers the cost of verification.

Referee Report

2 major / 2 minor

Summary. The paper introduces MiniGPT-v2, a vision-language model positioned as a unified interface for multiple tasks including image description, visual question answering (VQA), and visual grounding. It proposes the use of unique task identifier tokens during a three-stage training procedure to help the model distinguish instructions and improve per-task learning efficiency, claiming that the resulting model achieves strong benchmark performance relative to other vision-language generalist models.

Significance. If the performance claims are substantiated, the work would provide a lightweight mechanism for multi-task VL learning that avoids complex task-specific heads or routing, potentially simplifying the construction of generalist models. The public release of model weights and code would further support reproducibility and follow-up research.

major comments (2)

[Abstract and §3] Abstract and §3 (method): The central claim that unique task identifiers 'enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task' is presented without any ablation that isolates their contribution. No comparison to an otherwise identical model trained without identifiers is reported, leaving open the possibility that observed gains arise from data scale, the base LLM, or other unablated factors rather than the identifier mechanism.
[§4] §4 (experiments): The abstract asserts 'strong performance on many visual question-answering and visual grounding benchmarks' after three-stage training, yet the provided text supplies no quantitative numbers, baseline tables, or statistical significance tests. Without these details it is impossible to judge the magnitude or reliability of the claimed improvements over prior generalist models.

minor comments (2)

[§3] The three training stages are referenced but their exact data mixtures, learning-rate schedules, and loss weighting are not fully specified in the abstract; a concise table summarizing stage-wise objectives and data sources would improve clarity.
[§3] Notation for the task identifier tokens (e.g., how they are embedded and concatenated with visual and textual tokens) should be formalized with an equation or diagram to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and have revised the manuscript to strengthen the presentation of the task-identifier mechanism and the experimental results.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): The central claim that unique task identifiers 'enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task' is presented without any ablation that isolates their contribution. No comparison to an otherwise identical model trained without identifiers is reported, leaving open the possibility that observed gains arise from data scale, the base LLM, or other unablated factors rather than the identifier mechanism.

Authors: We agree that an explicit ablation isolating the identifiers would strengthen the claim. In the revised manuscript we have added a controlled ablation (same data, same base LLM, same training stages) that directly compares performance with and without the task-identifier tokens. The results show measurable gains in both task distinction (lower confusion across instructions) and final benchmark scores, confirming the identifiers' contribution beyond data scale or model choice. revision: yes
Referee: [§4] §4 (experiments): The abstract asserts 'strong performance on many visual question-answering and visual grounding benchmarks' after three-stage training, yet the provided text supplies no quantitative numbers, baseline tables, or statistical significance tests. Without these details it is impossible to judge the magnitude or reliability of the claimed improvements over prior generalist models.

Authors: We apologize for the omission in the submitted version. Section 4 and the associated tables have been expanded to include all quantitative results (exact accuracy, CIDEr, mIoU, etc.), direct comparisons against the cited generalist baselines, and statistical significance tests where sample sizes permit. Key tables are now placed in the main text rather than the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training results with no self-referential derivation

full rationale

The paper describes a three-stage training procedure for MiniGPT-v2 that incorporates unique task identifiers as a modeling choice. Performance claims on VQA and grounding benchmarks are presented strictly as experimental outcomes after training, not as quantities derived from equations or parameters that reduce to the inputs by construction. No self-definitional steps, fitted inputs relabeled as predictions, or load-bearing self-citations appear in the abstract or described chain. The identifier mechanism is an architectural decision whose benefits are asserted but not mathematically forced from prior results within the paper itself. This is a standard empirical ML contribution whose central results remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of task identifiers plus standard vision-language pre-training; no new physical entities or formal axioms are introduced.

free parameters (1)

task identifier tokens
Short unique strings assigned to each task; chosen by the authors and treated as fixed inputs during training.

axioms (1)

domain assumption Large language models can be extended to process images via an image encoder and instruction tuning.
Invoked when the abstract states the model is built on top of an LLM for vision-language tasks.

pith-pipeline@v0.9.0 · 5504 in / 1191 out tokens · 35011 ms · 2026-05-16T07:08:36.173359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
cs.CV 2026-05 unverdicted novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles
cs.CY 2026-05 unverdicted novelty 7.0

A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.
STORM: End-to-End Referring Multi-Object Tracking in Videos
cs.CV 2026-04 unverdicted novelty 7.0

STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
cs.CV 2026-04 unverdicted novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
cs.CR 2026-05 unverdicted novelty 6.0

DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
cs.LG 2026-05 unverdicted novelty 6.0

SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
BLINK: Multimodal Large Language Models Can See but Not Perceive
cs.CV 2024-04 accept novelty 6.0

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
cs.CL 2023-11 unverdicted novelty 6.0

AMBER is an LLM-free multi-dimensional benchmark for evaluating hallucinations in MLLMs across generative and discriminative tasks.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
cs.CV 2023-06 accept novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

StateVLM uses an Auxiliary Regression Loss on box decoder outputs to boost VLMs' accuracy on object and state localization for robotic affordance reasoning, with gains of 1.6% on RefCOCO variants and 5.2% on the new O...
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
cs.CV 2026-04 unverdicted novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis
cs.CV 2026-04 unverdicted novelty 5.0

RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
cs.CV 2026-04 unverdicted novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 19 Pith papers · 22 internal anchors

[1]

https://github.com/domeccleston/sharegpt, 2023

Sharegpt. https://github.com/domeccleston/sharegpt, 2023

work page 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022

work page 2022
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[5]

Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022

work page 2022
[6]

Video chatcaptioner: Towards the enriched spatiotemporal descriptions

Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, and Mohamed Elhoseiny. Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227, 2023

work page arXiv 2023
[7]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

work page 2023
[9]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[11]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022

work page arXiv 2022
[13]

Multimodal-gpt: A vision and language model for dialogue with humans

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023

work page arXiv 2023
[14]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017
[15]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

work page 2018
[16]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Unnatural instructions: Tuning language models with (almost) no human labor

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022

work page arXiv 2022
[18]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[20]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

work page 2014
[21]

The hateful memes challenge: Detecting hate speech in multimodal memes

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems, 33:2611–2624, 2020

work page 2020
[22]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 10

work page 2014
[25]

Visual spatial reasoning

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[26]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021

work page arXiv 2021
[29]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

work page 2016
[30]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

work page 2019
[31]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019

work page 2019
[32]

Introducing chatgpt

OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022

work page 2022
[33]

Im2text: Describing images using 1 million captioned photographs

Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011

work page 2011
[34]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[35]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

work page 2015
[38]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

work page 2019
[39]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

A- okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A- okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022

work page 2022
[44]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018

work page 2018
[45]

Textcaps: a dataset for image captioningwith reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioningwith reading comprehension. 2020

work page 2020
[46]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[48]

Introducing mpt-7b: A new standard for open-source, commercially usable llms,

MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms,

work page
[49]

Accessed: 2023-05-05

work page 2023
[50]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation 11 language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Multimodal few-shot learning with frozen language models

Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems , 34:200–212, 2021

work page 2021
[53]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to- sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to- sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022

work page 2022
[54]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023

work page arXiv 2023
[55]

Universal instance perception as object discovery and retrieval

Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15325–15336, 2023

work page 2023
[56]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016

work page 2016
[58]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions

Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, and Mohamed Elhoseiny. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023

work page arXiv 2023
[60]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Mindstorms in natural language-based societies of mind

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023. 12 A Appendix In the supplementary, we provide more qualitative results that ar...

work page arXiv 2023