MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Pith reviewed 2026-05-10 20:32 UTC · model grok-4.3
The pith
Aligning a frozen visual encoder to Vicuna via one projection layer and two-stage training produces GPT-4-like multimodal abilities such as sketch-to-website generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By aligning a frozen visual encoder with the frozen Vicuna LLM using one projection layer, trained first on short image captions and then on detailed image descriptions, MiniGPT-4 acquires numerous advanced multi-modal abilities including generating detailed image descriptions, creating websites from hand-drawn drafts, writing stories and poems inspired by images, teaching cooking from food photos, and other emerging capabilities similar to those in GPT-4.
What carries the argument
The single projection layer that maps outputs from the frozen visual encoder into the input space of the frozen Vicuna language model, enabling the LLM to interpret and respond to visual information after two-stage training.
If this is right
- The model generates detailed and natural image descriptions without repetition or fragmentation.
- Websites can be created directly from hand-drawn drafts or sketches.
- Stories and poems can be written based on input images.
- Cooking instructions can be provided from photos of food.
- Additional emergent abilities such as identifying humorous elements in images appear.
Where Pith is reading between the lines
- This indicates that freezing both the vision encoder and the LLM while training only the connector suffices for advanced multimodal performance.
- Similar alignment could be tested with other base LLMs to determine if the choice of Vicuna is necessary for these specific capabilities.
- The emphasis on a second-stage dataset of detailed descriptions implies that data curation may be as important as the alignment architecture itself for usable outputs.
Load-bearing premise
The observed advanced abilities result primarily from alignment with the advanced LLM rather than from the choice of training data or the projection layer memorizing patterns in the detailed description dataset.
What would settle it
Train the identical projection layer on the same data but align it to a weaker language model instead of Vicuna, then test whether capabilities such as website creation from hand-drawn drafts disappear.
read the original abstract
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model's generation reliability and overall usability. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MiniGPT-4, which aligns a frozen visual encoder with the frozen Vicuna LLM using a single projection layer. It employs two-stage training—first on short image-caption pairs, then fine-tuning on a curated dataset of detailed image descriptions—to achieve advanced vision-language capabilities similar to GPT-4, including detailed image descriptions, website generation from hand-drawn sketches, story/poem writing from images, and instructional responses from visual inputs. The authors support these claims with qualitative examples and release the code, pre-trained weights, and dataset.
Significance. If the central claim holds, the work is significant because it shows that sophisticated multi-modal generation abilities can be obtained by aligning visual features with an advanced frozen LLM without retraining the language model itself. The public release of code, weights, and the detailed-description dataset is a clear strength that enables reproducibility and community follow-up. The primarily qualitative evaluation and lack of isolating experiments, however, limit the strength of the attribution to the LLM choice.
major comments (2)
- [Experiments] Experiments section: The assertion that the second-stage fine-tuning on the curated detailed-description dataset resolves unnatural outputs (repetitions and fragmentation) is supported solely by anecdotal before-and-after examples; no quantitative metrics (e.g., human preference scores, perplexity on held-out captions, or automated coherence measures) are reported to document the magnitude or reliability of the improvement.
- [Method and Experiments] Method and Experiments sections: The central claim that advanced capabilities arise from alignment with an advanced LLM (Vicuna) is not isolated from confounding factors; the manuscript contains no ablations comparing the same pipeline with a weaker LLM (e.g., base LLaMA), with the second-stage dataset removed, or with a higher-capacity projection layer, leaving open the possibility that observed fluency and task performance derive primarily from the high-quality second-stage data or projection-layer memorization.
minor comments (2)
- [Abstract] Abstract: The phrase 'for the first time' overstates novelty given prior alignment work (e.g., BLIP-2); rephrase to highlight the specific combination of Vicuna and the two-stage detailed-description stage.
- [Qualitative results] Qualitative results: The presented examples would be strengthened by inclusion of failure cases or a broader range of out-of-distribution images to give readers a balanced view of model limitations.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and positive assessment of the work's significance and reproducibility. We address the major comments point-by-point below and outline proposed revisions.
read point-by-point responses
-
Referee: Experiments section: The assertion that the second-stage fine-tuning on the curated detailed-description dataset resolves unnatural outputs (repetitions and fragmentation) is supported solely by anecdotal before-and-after examples; no quantitative metrics (e.g., human preference scores, perplexity on held-out captions, or automated coherence measures) are reported to document the magnitude or reliability of the improvement.
Authors: We thank the referee for this observation. Our current evidence for the benefits of the second-stage fine-tuning is qualitative. We agree that quantitative support would be valuable. In the revised manuscript, we will include results from a human study where evaluators compare first-stage and second-stage outputs on naturalness and coherence for a held-out set of images, reporting win rates or preference percentages. revision: yes
-
Referee: Method and Experiments sections: The central claim that advanced capabilities arise from alignment with an advanced LLM (Vicuna) is not isolated from confounding factors; the manuscript contains no ablations comparing the same pipeline with a weaker LLM (e.g., base LLaMA), with the second-stage dataset removed, or with a higher-capacity projection layer, leaving open the possibility that observed fluency and task performance derive primarily from the high-quality second-stage data or projection-layer memorization.
Authors: We concur that isolating the contribution of the advanced LLM through ablations would strengthen the attribution. However, we did not perform training with base LLaMA due to the substantial computational cost and time required. The manuscript already notes that the first-stage model (without second-stage fine-tuning) produces unnatural outputs, indicating the second stage's role in improving language quality. We used a minimal projection layer to show that advanced capabilities do not require complex alignment modules. In revision, we will add a dedicated limitations paragraph discussing these points and the potential role of the second-stage data, while maintaining that the simple alignment to Vicuna enables the observed GPT-4-like behaviors as evidenced by the qualitative demonstrations. revision: partial
- We cannot conduct the full set of ablation experiments with base LLaMA or additional quantitative isolation studies within the scope of this work due to resource limitations.
Circularity Check
No circularity; empirical alignment demonstration with no derivation chain or self-referential predictions.
full rationale
The paper makes no mathematical or first-principles claims. It describes an empirical two-stage training procedure (short captions then curated detailed descriptions) to align a frozen visual encoder to a frozen Vicuna LLM via a single projection layer, then reports qualitative capabilities. No equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing steps reduce to self-citations or ansatzes imported from prior author work. The central observation—that alignment yields GPT-4-like behaviors—is presented as an empirical finding supported by released model outputs and a public dataset, not as a derivation that collapses to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- projection layer weights
axioms (1)
- domain assumption Frozen visual encoder and frozen Vicuna LLM preserve their pre-trained capabilities when only the projection is trained.
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
MVI-Bench supplies the first taxonomy and dataset focused on misleading visual inputs to measure LVLM robustness, with tests on 18 models revealing clear weaknesses.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment
Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
-
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
-
OZ-TAL: Online Zero-Shot Temporal Action Localization
Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
-
CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding
CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
-
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning
UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
-
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
PolarVLM integrates polarimetric physical parameters into VLMs via dual-stream architecture and progressive training, outperforming RGB baselines by 25.4% on a new 75K-pair polarization-aware VQA benchmark.
-
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks w...
-
Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection
S2M extracts structured text quadruples from change masks to provide noise-free multimodal supervision, achieving 17.80% Sek and 66.14% F_scd on the new Gaza-Change-v2 dataset and outperforming LLM-based multimodal methods.
-
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
-
VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection
VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot...
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
-
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
-
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
-
Skill-Conditioned Visual Geolocation for Vision-Language Models
GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
-
Skill-Conditioned Visual Geolocation for Vision-Language Models
GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
-
Topo-R1: Detecting Topological Anomalies via Vision-Language Models
Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.
-
SpatialMosaic: A Multiview VLM Dataset for Partial Visibility
SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.
-
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
-
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
DeFacto trains multimodal models using counterfactual image variants and reinforcement learning rewards to improve both answer accuracy and evidence-answer consistency.
-
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM perf...
-
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting
The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.
-
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
-
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion p...
-
FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks
FDM-Bench is a new benchmark dataset for evaluating LLMs on FDM tasks including user queries and G-code anomaly detection, with expert-assessed results showing closed-source models outperforming on anomaly detection a...
-
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving
ADvLM is the first visual adversarial attack framework for VLMs in autonomous driving, using semantic-invariant induction via LLM-generated prompt libraries and scenario-associated attention-based enhancement to achie...
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
-
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
-
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
-
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
MirrorCheck detects adversarial attacks on VLMs via T2I regeneration for semantic consistency checks, using stochastic model selection and one-time perturbations for robustness against adaptive attacks.
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Multimodal LLMs under Pairwise Modalities
A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
-
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
-
FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
FullFlow adds LoRA adapters and discrete text insertion to pretrained rectified-flow text-to-image models, achieving bidirectional generation with major gains in FID, CIDEr, VRAM, and throughput over Dual Diffusion baselines.
-
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
Vision Inference Former adds a direct visual-to-output bridge that continuously injects visual semantics during MLLM decoding to sustain consistency and reduce modality imbalance.
Reference graph
Works this paper leans on
-
[1]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,
work page 1901
-
[2]
Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, and Mohamed Elhoseiny. Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227,
-
[3]
PaLM: Scaling Language Modeling with Pathways
URL https: //vicuna.lmsys.org. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review arXiv
-
[4]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416,
work page internal anchor Pith review arXiv
-
[5]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,
work page internal anchor Pith review arXiv
-
[7]
EV A: Exploring the Limits of Masked Visual Repre- sentation Learning at Scale 2022
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636,
-
[8]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045,
work page internal anchor Pith review arXiv
-
[11]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pp. 787–798,
work page 2014
-
[12]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597,
work page internal anchor Pith review arXiv
-
[13]
Connecting vision and language with localized narratives
Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. Connecting vision and language with localized narratives. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16 , pp. 647–664. Springer,
work page 2020
-
[14]
Object Hallucination in Image Captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156,
-
[15]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagn´e, Alexandra Sasha Luccioni, Fran c ¸ois Yvon, Matthias Gall ´e, et al. Bloom: A 176b- parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022a. Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c,...
work page internal anchor Pith review arXiv
-
[16]
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990,
-
[17]
ViperGPT: Visual Inference via Python Execution for Reasoning
D´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128,
work page internal anchor Pith review arXiv
-
[18]
Plug-and- play vqa: Zero-shot vqa by conjoining large pre- trained models with zero training
Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven CH Hoi. Plug-and- play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. arXiv preprint arXiv:2210.08773,
-
[19]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
ISSN 2835-8856. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification. Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Vi- sual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671,
work page internal anchor Pith review arXiv
-
[21]
Zero-shot video question answering via frozen bidirectional language models
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models. arXiv preprint arXiv:2206.08155,
-
[22]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review arXiv
-
[23]
Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions
Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, and Mohamed Elho- seiny. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594,
-
[24]
12 Preprint A A PPENDIX A.1 M ORE QUALITATIVE RESULTS Explain why this meme is funny. Thismemeisfunnybecauseitshowsadoglayingonthefloorwithitsheadrestingonitspaws,asifitistiredorsleepy.Thecaption,"mondayjustmonday,"addstothehumorbysuggestingthatthedogisfeelingthesamewayasmanypeopledoonMondays,whichareoftenconsideredtobethemostdreadeddayoftheweek.Thememeis...
work page 1969
-
[25]
and GQA (Hudson & Manning, 2019). Additionally, to showcase the potential of MiniGPT-4 with traditional benchmarks, we conduct a straightforward ablation study. Here, we simply unfreeze the LLM using LoRA (Hu et al.,
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.