arxiv: 2605.07544 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

From Pixels to Prompts: Vision-Language Models

Khang Hoang Nhat Vo

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords vision-language modelsmental mapmultimodal AIoverviewprompt engineeringimage understandinglanguage reasoning

0 comments

The pith

A mental map of vision-language models supplies structure to read new papers confidently and design systems without blind assembly of parts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The manuscript addresses the difficulty of keeping up with fast-moving vision-language models, where new variants appear constantly and the distance between buzzwords and actual understanding feels wide. Rather than catalog every dataset, benchmark, and model, it supplies a structured overview that organizes core ideas from image processing through language interaction. This map is meant to let readers interpret fresh research without prior exhaustive knowledge and to let builders create applications by grasping principles instead of copying components. The author presents this as a modest but durable alternative to exhaustive surveys in a field that changes rapidly.

Core claim

The book provides a clear mental map of vision-language models that gives enough structure to read new papers with confidence and enough intuition to design systems without assembling components blindly, instead of offering an exhaustive catalog of datasets, benchmarks, and variants.

What carries the argument

The mental map itself: a non-exhaustive, intuition-focused overview that organizes how vision-language models connect visual input to language output, reasoning, and instruction following.

If this is right

New papers become readable by fitting their contributions into the existing map rather than starting from zero.
System design shifts from copying existing architectures to combining understood components with clear purpose.
The gap between surface familiarity and operational knowledge narrows for people entering the field.
Ongoing model releases can be integrated into the same framework instead of requiring separate learning each time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same map approach could be applied to other fast-moving multimodal areas to reduce knowledge fragmentation.
Interactive versions of the map might let users explore component relationships directly rather than through text alone.
In fields with high publication volume, targeted conceptual overviews may prove more useful for practitioners than comprehensive surveys.
The structure could highlight common failure modes across models, making debugging more systematic.

Load-bearing premise

A non-exhaustive overview centered on intuition can still deliver lasting understanding without needing complete coverage of every dataset, benchmark, and model variant.

What would settle it

Readers who study the map still cannot interpret a new vision-language paper or design a working system with less trial-and-error than readers who rely only on scattered papers and code repositories.

Figures

Figures reproduced from arXiv: 2605.07544 by Khang Hoang Nhat Vo.

**Figure 2.1.** Figure 2.1: Schematic of a classical convolutional neural network used as a visual en [PITH_FULL_IMAGE:figures/full_fig_p020_2_1.png] view at source ↗

**Figure 2.2.** Figure 2.2: Illustration of a residual block in a deep residual network (ResNet) [ [PITH_FULL_IMAGE:figures/full_fig_p020_2_2.png] view at source ↗

**Figure 2.3.** Figure 2.3: Architecture of the Vision Transformer (ViT) [ [PITH_FULL_IMAGE:figures/full_fig_p022_2_3.png] view at source ↗

**Figure 2.4.** Figure 2.4: Illustrations of Feature Pyramid Networks (FPNs) for multi-scale object [PITH_FULL_IMAGE:figures/full_fig_p024_2_4.png] view at source ↗

**Figure 2.5.** Figure 2.5: Illustration of the Momentum Contrast framework for self-supervised rep [PITH_FULL_IMAGE:figures/full_fig_p026_2_5.png] view at source ↗

**Figure 2.6.** Figure 2.6: Architecture of Bootstrap Your Own Latent (BYOL) for self-supervised vi [PITH_FULL_IMAGE:figures/full_fig_p027_2_6.png] view at source ↗

**Figure 3.1.** Figure 3.1: Recurrent neural network for language modeling, unrolled over time. At [PITH_FULL_IMAGE:figures/full_fig_p030_3_1.png] view at source ↗

**Figure 3.2.** Figure 3.2: Transformer encoder–decoder architecture for sequence modeling [ [PITH_FULL_IMAGE:figures/full_fig_p032_3_2.png] view at source ↗

**Figure 4.1.** Figure 4.1: Architecture of the Show and Tell image captioning model of Vinyals et al. [PITH_FULL_IMAGE:figures/full_fig_p038_4_1.png] view at source ↗

**Figure 4.2.** Figure 4.2: Schematic overview of the BLIP architecture [ [PITH_FULL_IMAGE:figures/full_fig_p039_4_2.png] view at source ↗

**Figure 4.3.** Figure 4.3: Illustration of the BLIP-2 architecture with a Querying Transformer (Q [PITH_FULL_IMAGE:figures/full_fig_p040_4_3.png] view at source ↗

**Figure 4.4.** Figure 4.4: Bootstrapping BLIP-2 from different classes of large language models [ [PITH_FULL_IMAGE:figures/full_fig_p041_4_4.png] view at source ↗

**Figure 4.5.** Figure 4.5: High-level Flamingo architecture [3]. One or more images are first encoded by a frozen vision backbone, producing dense feature maps. A Perceiver Resampler-a small transformer trained from scratch-consumes these features and outputs a fixed-size set of latent visual tokens for each image, decoupling the number of visual tokens from the input resolution. These visual tokens are injected into selected l… view at source ↗

**Figure 4.6.** Figure 4.6: Internal structure of Flamingo’s gated cross-attention (GATED XATTN [PITH_FULL_IMAGE:figures/full_fig_p043_4_6.png] view at source ↗

**Figure 4.7.** Figure 4.7: Schematic of the LLaVA architecture [53]. An input image Xv is encoded by a frozen CLIP Vision Transformer, producing visual features Zv. A learned projection matrix W (implemented as a small MLP) maps Zv into the language model’s hidden space, yielding a sequence of visual embeddings Hv. These visual tokens are concatenated with the hidden representations Hq of a text instruction Xq and passed into a f… view at source ↗

**Figure 4.8.** Figure 4.8: Three-stage training pipeline for Qwen-VL [ [PITH_FULL_IMAGE:figures/full_fig_p046_4_8.png] view at source ↗

**Figure 5.1.** Figure 5.1: Illustration of the Conceptual Captions (CC3M) text normalization [PITH_FULL_IMAGE:figures/full_fig_p056_5_1.png] view at source ↗

**Figure 5.2.** Figure 5.2: Examples from LAION-5B [71]. Each column shows a text query (Q) and the image with its caption (C) returned as the nearest neighbor in CLIP embedding space. The queries behave like natural user prompts (e.g., “An armchair that looks like an apple”, “pink photo of Tokyo”), while the associated captions summarize the retrieved images. captions, and a scene-graph-style representation linking objects (e.g., … view at source ↗

**Figure 5.3.** Figure 5.3: Example from Visual Genome [40]. The image is annotated with object bounding boxes (left), region-level captions describing localized parts of the scene (middle), and a scene graph (bottom) whose nodes denote objects and attributes and whose edges encode labeled relations (e.g., man sits on bench, bench in front of river). 62 [PITH_FULL_IMAGE:figures/full_fig_p070_5_3.png] view at source ↗

**Figure 5.4.** Figure 5.4: Examples from the Open Images dataset [42]. Left: image-level labels for multi-label classification. Middle: object detection annotations with category-specific bounding boxes. Right: visual relationship detection annotations, where pairs of boxes (e.g., man, guitar) are linked by relation labels (e.g., holds) [PITH_FULL_IMAGE:figures/full_fig_p071_5_4.png] view at source ↗

**Figure 5.5.** Figure 5.5: Examples of clip-caption pairs from HowTo100M [ [PITH_FULL_IMAGE:figures/full_fig_p071_5_5.png] view at source ↗

**Figure 5.6.** Figure 5.6: Example video-caption pairs from the WebVid dataset (WebVid2M) [ [PITH_FULL_IMAGE:figures/full_fig_p072_5_6.png] view at source ↗

**Figure 5.7.** Figure 5.7: Example entry from the WIT dataset [76], based on the Wikipedia page for Half Dome. The figure highlights the different textual fields extracted for each image: page title, lead paragraph, section titles, image caption, and reference description. WIT stores these fields (often in multiple languages) alongside the associated image, providing rich, structured supervision for multilingual multimodal pretra… view at source ↗

**Figure 5.8.** Figure 5.8: Illustrative examples from TextCaps [73]. Each image is paired with multiple captions that explicitly refer to the text visible in the scene (e.g., digits on displays, brand names, slogans). Some caption tokens directly copy text from the image, while others paraphrase it or add inferred information, making the dataset a strong benchmark for text-aware visual captioning. 65 [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 5.9.** Figure 5.9: Example image-caption pairs from the MS COCO Captions benchmark [ [PITH_FULL_IMAGE:figures/full_fig_p074_5_9.png] view at source ↗

**Figure 5.10.** Figure 5.10: Examples from Flickr30k Entities [64]. Each image is annotated with bounding boxes around salient entities (e.g., man, glasses, wedding cake), and corresponding noun phrases in the captions are color-coded to match these regions. This phrase-to-region supervision supports training and evaluation of models that must ground textual mentions to specific objects in the image. 66 [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 5.11.** Figure 5.11: Example question-image-answer triplets from the balanced VQA v2 [PITH_FULL_IMAGE:figures/full_fig_p075_5_11.png] view at source ↗

**Figure 5.12.** Figure 5.12: Example from the GQA dataset [35]. Objects, attributes, and relations (e.g., bowl, apple, green, behind, on top of ) are annotated in the image, and naturallanguage questions are derived from the underlying scene graph, such as “Is the bowl to the right of the green apple?” or “What type of fruit in the image is round?”. This design explicitly links questions to structured semantics and supports detail… view at source ↗

**Figure 5.13.** Figure 5.13: Example question-image-answer triplets from OK-VQA [ [PITH_FULL_IMAGE:figures/full_fig_p076_5_13.png] view at source ↗

**Figure 5.14.** Figure 5.14: Example images and questions from the VizWiz dataset [ [PITH_FULL_IMAGE:figures/full_fig_p077_5_14.png] view at source ↗

**Figure 5.15.** Figure 5.15: Example from the RefCOCO family of referring expression datasets [ [PITH_FULL_IMAGE:figures/full_fig_p077_5_15.png] view at source ↗

**Figure 5.16.** Figure 5.16: Representative examples from the TextVQA dataset [ [PITH_FULL_IMAGE:figures/full_fig_p078_5_16.png] view at source ↗

**Figure 5.17.** Figure 5.17: Example from DocVQA [59]. The model must answer multiple questions about a scanned envelope, such as the ZIP code, the postmark date, and the company name. Successful solutions require accurate OCR, spatial layout understanding, and reasoning over several textual elements on the page [PITH_FULL_IMAGE:figures/full_fig_p079_5_17.png] view at source ↗

**Figure 5.18.** Figure 5.18: Example from ChartQA [58]. The model is asked questions about a line chart summarizing survey responses, such as “Which year has the most divergent opinions about Brazil’s economy?” and “What is the peak value of the orange line?”. Solving such problems requires accurate reading of plotted values, comparison between series, and reasoning over trends rather than just recognizing objects. 71 [PITH_FULL_I… view at source ↗

**Figure 5.19.** Figure 5.19: Overview of the POPE (Polling-based Object Probing Evaluation) pipeline for measuring object hallucination in LVLMs [46]. Given an input image, ground-truth object categories (e.g., person, chair, umbrella) are obtained from human or automatic annotations, while nonexistent objects (e.g., dog, table, surfboard) are sampled from random, frequent, or adversarial distributions. For each object, the model … view at source ↗

**Figure 5.20.** Figure 5.20: Ability taxonomy in MMBench [54]. The benchmark organizes visual questions into hierarchical ability dimensions, separating broad Perception and Reasoning categories and further decomposing them into fine-grained skills (outer ring), such as OCR, spatial relationships, logical reasoning, and identity or attribute reasoning. 72 [PITH_FULL_IMAGE:figures/full_fig_p080_5_20.png] view at source ↗

**Figure 5.21.** Figure 5.21: Overview of the MMMU benchmark [86]. The dataset comprises ∼11.5K college-level, multiple-choice problems drawn from six broad disciplines and 30 subjects, featuring heterogeneous image types (e.g., diagrams, charts, photographs), interleaved text and images, and questions designed to probe expert-level perception, knowledge, and reasoning. 73 [PITH_FULL_IMAGE:figures/full_fig_p081_5_21.png] view at source ↗

read the original abstract

When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: \emph{it is too easy to get lost}. The field moves quickly, new model names appear constantly, and the gap between ``I know the buzzwords'' and ``I actually understand how this works'' can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript offers an expository overview of Vision-Language Models, with the central claim that a non-exhaustive, intuition-focused mental map can supply enough structure for readers to engage new VLM papers confidently and to design systems without assembling components blindly. It explicitly disclaims exhaustive coverage of datasets, benchmarks, or model variants, framing its contribution as durable pedagogical structure rather than completeness or novel technical results.

Significance. If the delivered mental map proves clear and durable, the work could hold meaningful pedagogical value in the fast-moving cs.AI field by reducing the gap between superficial buzzword knowledge and practical understanding, thereby supporting more effective paper reading and system design in multimodal models.

minor comments (2)

Abstract: the text contains unrendered LaTeX such as ``it is too easy to get lost''; ensure consistent formatting and rendering in the final version.
Throughout: repeated self-reference to ``this book'' creates ambiguity about the intended format and venue; clarify whether the manuscript is a journal article, survey, or book chapter.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript's pedagogical intent and for recommending minor revision. The work deliberately prioritizes a durable, intuition-focused mental map over exhaustive coverage of datasets or models, as stated in the abstract, to help readers navigate the fast-moving VLM literature with greater confidence.

Circularity Check

0 steps flagged

No significant circularity; purely expository overview

full rationale

The manuscript contains no derivations, equations, fitted parameters, predictions, or self-citations that could form a load-bearing chain. Its stated purpose is to supply an intuition-focused mental map for readers, explicitly disclaiming exhaustive coverage of datasets or models. No step reduces by construction to its own inputs, and the central claim remains independent of any internal fitting or renaming of results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an explanatory book on existing technology, the work introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5508 in / 895 out tokens · 28910 ms · 2026-05-11T02:40:51.733916+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Core Building Blocks: visual encoder maps I to sequence of visual tokens v_j; language model conditions on visual tokens via cross-attention or prefix; fusion via adapters or Q-Former.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Multimodal Alignment Objectives: contrastive alignment, generative language modeling, instruction tuning.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · 3 internal anchors

[1]

International Conference on Learning Representations (ICLR) , year =

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations (ICLR) , year =

work page
[2]

Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

work page
[3]

Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian , booktitle =. Faster

work page
[4]

He, Kaiming and Gkioxari, Georgia and Doll. Mask. IEEE International Conference on Computer Vision (ICCV) , year =

work page
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[6]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[7]

Training language models to follow instructions with human feedback

Training Language Models to Follow Instructions with Human Feedback , author =. arXiv preprint arXiv:2203.02155 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[8]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models , author =. arXiv preprint arXiv:2302.13971 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

Scaling up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

work page
[10]

Lu, Jiasen and Batra, Dhruv and Parikh, Devi and Lee, Stefan , booktitle =

work page
[11]

Tan, Hao and Bansal, Mohit , booktitle =

work page
[12]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Flamingo: a Visual Language Model for Few-Shot Learning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[13]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author =. arXiv preprint arXiv:2301.12597 , year =

work page internal anchor Pith review arXiv
[14]

Microsoft

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. European Conference on Computer Vision (ECCV) , year =

work page
[15]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Show and Tell: A Neural Image Caption Generator , author =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[16]

Lawrence and Parikh, Devi , booktitle =

Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , booktitle =

work page
[17]

Making the V in

Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , booktitle =. Making the V in

work page
[18]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Deep Visual-Semantic Alignments for Generating Image Descriptions , author =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[19]

IEEE International Conference on Computer Vision (ICCV) , year =

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , author =. IEEE International Conference on Computer Vision (ICCV) , year =

work page
[20]

Empirical Methods in Natural Language Processing (EMNLP) , year =

ReferItGame: Referring to Objects in Photographs of Natural Scenes , author =. Empirical Methods in Natural Language Processing (EMNLP) , year =

work page
[21]

Gradient-Based Learning Applied to Document Recognition , journal =

LeCun, Yann and Bottou, L. Gradient-Based Learning Applied to Document Recognition , journal =

work page
[22]

, title =

Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E. , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page
[23]

International Conference on Learning Representations (ICLR) , year =

Simonyan, Karen and Zisserman, Andrew , title =. International Conference on Learning Representations (ICLR) , year =

work page
[24]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

work page
[25]

Feature Pyramid Networks for Object Detection , booktitle =

Lin, Tsung-Yi and Doll. Feature Pyramid Networks for Object Detection , booktitle =

work page
[26]

and Fei-Fei, Li , title =

Russakovsky, Olga and Deng, Jia and Su, Hao and Krause, Jonathan and Satheesh, Sanjeev and Ma, Sean and Huang, Zhiheng and Karpathy, Andrej and Khosla, Aditya and Bernstein, Michael and Berg, Alexander C. and Fei-Fei, Li , title =. International Journal of Computer Vision (IJCV) , year =

work page
[27]

Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey , title =. Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

work page
[28]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

He, Kaiming and Fan, Haoqi and Wu, Yuxin and Xie, Saining and Girshick, Ross , title =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[29]

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , booktitle =

Grill, Jean-Bastien and Strub, Florian and Altch. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , booktitle =. 2020 , volume =

work page 2020
[30]

Emerging Properties in Self-Supervised Vision Transformers , booktitle =

Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J. Emerging Properties in Self-Supervised Vision Transformers , booktitle =. 2021 , pages =

work page 2021
[31]

Masked Autoencoders Are Scalable Vision Learners , booktitle =

He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll. Masked Autoencoders Are Scalable Vision Learners , booktitle =. 2022 , pages =

work page 2022
[32]

, title =

Elman, Jeffrey L. , title =. Cognitive Science , volume =

work page
[33]

Long Short-Term Memory , journal =

Hochreiter, Sepp and Schmidhuber, J. Long Short-Term Memory , journal =

work page
[34]

Learning Phrase Representations using

Cho, Kyunghyun and van Merri. Learning Phrase Representations using. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

work page 2014
[35]

On the Opportunities and Risks of Foundation Models , author =

work page
[36]

Language Models are Unsupervised Multitask Learners , author =

work page
[37]

2019 , pages =

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , pages =

work page 2019
[38]

Proceedings of the 32nd International Conference on Machine Learning (ICML) , pages =

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , author =. Proceedings of the 32nd International Conference on Machine Learning (ICML) , pages =

work page
[39]

and Gotmare, Akhilesh and Joty, Shafiq and Xiong, Caiming and Hoi, Steven C

Li, Junnan and Selvaraju, Ramprasaath R. and Gotmare, Akhilesh and Joty, Shafiq and Xiong, Caiming and Hoi, Steven C. H. , booktitle =

work page
[40]

Advances in Neural Information Processing Systems , year =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems , year =

work page
[41]

2023 , note =

Qwen Technical Report , author =. 2023 , note =

work page 2023
[42]

Advances in Neural Information Processing Systems , year =

Language Is Not All You Need: Aligning Perception with Language Models , author =. Advances in Neural Information Processing Systems , year =

work page
[43]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

PaLI-X: On Scaling Up a Multilingual Vision and Language Model , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

work page
[44]

Zhai, Xiaohua and Wang, Xingyi and Mustafa, Basil and Kolesnikov, Alexander and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Houlsby, Neil and others , booktitle =

work page
[45]

Chen, Yen-Chun and Li, Linjie and Yu, Licheng and Kholy, Ahmed El and Ahmed, Faisal and Gan, Zhe and Cheng, Yu and Liu, Jingjing , booktitle =

work page
[46]

Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng , booktitle =

work page
[47]

Parameter-Efficient Transfer Learning for

Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bryan and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for

work page
[48]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

work page
[49]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Piyush Sharma and Nan Ding and Sebastian Goodman and Radu Soricut , title =. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[50]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Soravit Changpinyo and Piyush Sharma and Nan Ding and Radu Soricut , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[51]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , journal =

Ranjay Krishna and Yuke Zhu and Oliver Groth and Justin Johnson and Kenji Hata and Joshua Kravitz and Stephanie Chen and Yannis Kalantidis and Li. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , journal =. 2017 , volume =

work page 2017
[52]

The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale , booktitle =

Alina Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont. The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale , booktitle =

work page
[53]

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , booktitle =

Antoine Miech and Dimitri Zhukov and Jean. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , booktitle =

work page
[54]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Max Bain and Arsha Nagrani and G\"ul Varol and Andrew Zisserman , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page
[55]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Jun Xu and Tao Mei and Ting Yao and Yong Rui , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[56]

Proceedings of the 30th International World Wide Web Conference (WWW) , year =

Krishna Srinivasan and Karthik Raman and Jiecao Chen and Michael Bendersky and Marc Najork , title =. Proceedings of the 30th International World Wide Web Conference (WWW) , year =

work page
[57]

Harley and Alex Ufkes and Konstantinos G

Adam W. Harley and Alex Ufkes and Konstantinos G. Derpanis , title =. Proceedings of the International Conference on Document Analysis and Recognition (ICDAR) , year =

work page
[58]

PubLayNet: Largest Dataset Ever for Document Layout Analysis , booktitle =

Xu Zhong and Jianbin Tang and Antonio Jimeno. PubLayNet: Largest Dataset Ever for Document Layout Analysis , booktitle =

work page
[59]

Computer Vision --

Oleksii Sidorov and Ronghang Hu and Marcus Rohrbach and Amanpreet Singh , title =. Computer Vision --

work page
[60]

Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

Christoph Schuhmann and Romain Beaumont and Richard Vencu and Cade Gordon and Ross Wightman and Mehdi Cherti and Theo Coombes and Aarush Katta and Clayton Mullis and Mitchell Wortsman and Patrick Schramowski and Srivatsa Kundurthy and Katherine Crowson and Ludwig Schmidt and Robert Kaczmarczyk and Jenia Jitsev , title =. Advances in Neural Information Pro...

work page
[61]

2022 , howpublished =

PaLI: A Jointly-Scaled Multilingual Language-Image Model , author =. 2022 , howpublished =

work page 2022
[62]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

PaLI-X: On Scaling Up a Multilingual Vision and Language Model , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[63]

2024 , howpublished =

LLaVA-NeXT: Improved Reasoning,. 2024 , howpublished =

work page 2024
[64]

2024 , howpublished =

LLaVA-NeXT-Interleave: Tackling Multi-Image, Video, and 3D in Large Multimodal Models , author =. 2024 , howpublished =

work page 2024
[65]

Qwen2.5-

Shuai Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Sibo Song and others , year =. Qwen2.5-

work page
[66]

Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Zesen Cheng and others , year =. Qwen3-

work page
[67]

2025 , howpublished =

Ovis2.5 Technical Report , author =. 2025 , howpublished =

work page 2025
[68]

2025 , howpublished =

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author =. 2025 , howpublished =

work page 2025
[69]

Proceedings of the 5th Workshop on Vision and Language , year =

Multi30K: Multilingual English--German Image Descriptions , author =. Proceedings of the 5th Workshop on Vision and Language , year =

work page
[70]

Microsoft

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. Proceedings of the European Conference on Computer Vision (ECCV) , year =

work page
[71]

Transactions of the Association for Computational Linguistics , volume =

From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions , author =. Transactions of the Association for Computational Linguistics , volume =

work page
[72]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

nocaps: Novel Object Captioning at Scale , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page
[73]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[74]

Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle =

work page
[75]

and Guo, Abigale and Lin, Chi and Grauman, Kristen and Wilson, Caroline and Bigham, Jeffrey P

Gurari, Danna and Li, Qing and Stangl, Amanda J. and Guo, Abigale and Lin, Chi and Grauman, Kristen and Wilson, Caroline and Bigham, Jeffrey P. , booktitle =

work page
[76]

Proceedings of the European Conference on Computer Vision (ECCV) , year =

Modeling Context in Referring Expressions , author =. Proceedings of the European Conference on Computer Vision (ECCV) , year =

work page
[77]

Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara , booktitle =

work page
[78]

and Wang, Liwei and Cervantes, Chris M

Plummer, Bryan A. and Wang, Liwei and Cervantes, Chris M. and Caicedo, Juan C. and Hockenmaier, Julia and Lazebnik, Svetlana , booktitle =

work page
[79]

Singh, Ankur and Natarajan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Batra, Dhruv and Parikh, Devi and Rohrbach, Marcus , booktitle =

work page
[80]

Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C. V. , booktitle =

work page

Showing first 80 references.