LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
hub Canonical reference
arXiv preprint arXiv:2108.10904 , year=
Canonical reference. 71% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
GenLIP pretrains ViTs to generate language tokens from images via LM objective without contrastive batches or extra decoders, matching baselines on less data and improving on OCR after multi-resolution continued pretraining.
RIHA proposes a hierarchical alignment transformer that uses multi-scale visual and textual feature pyramids plus optimal transport to generate more accurate radiology reports from medical images.
MApLe disentangles anatomy and pathology to align free-text diagnostic sentences with specific patches in large medical images via multi-instance learning.
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
WRF4CIR uses weight-regularized fine-tuning with adversarial perturbations to mitigate overfitting in composed image retrieval and narrows the generalization gap on benchmarks.
HTSC-CIF applies hierarchical task decomposition and cross-modal causal intervention to generate medical reports from images while addressing domain knowledge, alignment, and bias challenges.
Introduces GRIT, LTMI, and a hierarchical attention framework claiming performance gains on image captioning, visual dialog, and ALFRED instruction following.
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
citing papers explorer
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.