pith. machine review for the scientific record. sign in

arxiv: 2605.07544 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

From Pixels to Prompts: Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:40 UTC · model grok-4.3

classification 💻 cs.AI
keywords vision-language modelsmental mapmultimodal AIoverviewprompt engineeringimage understandinglanguage reasoning
0
0 comments X

The pith

A mental map of vision-language models supplies structure to read new papers confidently and design systems without blind assembly of parts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The manuscript addresses the difficulty of keeping up with fast-moving vision-language models, where new variants appear constantly and the distance between buzzwords and actual understanding feels wide. Rather than catalog every dataset, benchmark, and model, it supplies a structured overview that organizes core ideas from image processing through language interaction. This map is meant to let readers interpret fresh research without prior exhaustive knowledge and to let builders create applications by grasping principles instead of copying components. The author presents this as a modest but durable alternative to exhaustive surveys in a field that changes rapidly.

Core claim

The book provides a clear mental map of vision-language models that gives enough structure to read new papers with confidence and enough intuition to design systems without assembling components blindly, instead of offering an exhaustive catalog of datasets, benchmarks, and variants.

What carries the argument

The mental map itself: a non-exhaustive, intuition-focused overview that organizes how vision-language models connect visual input to language output, reasoning, and instruction following.

If this is right

  • New papers become readable by fitting their contributions into the existing map rather than starting from zero.
  • System design shifts from copying existing architectures to combining understood components with clear purpose.
  • The gap between surface familiarity and operational knowledge narrows for people entering the field.
  • Ongoing model releases can be integrated into the same framework instead of requiring separate learning each time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same map approach could be applied to other fast-moving multimodal areas to reduce knowledge fragmentation.
  • Interactive versions of the map might let users explore component relationships directly rather than through text alone.
  • In fields with high publication volume, targeted conceptual overviews may prove more useful for practitioners than comprehensive surveys.
  • The structure could highlight common failure modes across models, making debugging more systematic.

Load-bearing premise

A non-exhaustive overview centered on intuition can still deliver lasting understanding without needing complete coverage of every dataset, benchmark, and model variant.

What would settle it

Readers who study the map still cannot interpret a new vision-language paper or design a working system with less trial-and-error than readers who rely only on scattered papers and code repositories.

Figures

Figures reproduced from arXiv: 2605.07544 by Khang Hoang Nhat Vo.

Figure 2.1
Figure 2.1. Figure 2.1: Schematic of a classical convolutional neural network used as a visual en [PITH_FULL_IMAGE:figures/full_fig_p020_2_1.png] view at source ↗
Figure 2.2
Figure 2.2. Figure 2.2: Illustration of a residual block in a deep residual network (ResNet) [ [PITH_FULL_IMAGE:figures/full_fig_p020_2_2.png] view at source ↗
Figure 2.3
Figure 2.3. Figure 2.3: Architecture of the Vision Transformer (ViT) [ [PITH_FULL_IMAGE:figures/full_fig_p022_2_3.png] view at source ↗
Figure 2.4
Figure 2.4. Figure 2.4: Illustrations of Feature Pyramid Networks (FPNs) for multi-scale object [PITH_FULL_IMAGE:figures/full_fig_p024_2_4.png] view at source ↗
Figure 2.5
Figure 2.5. Figure 2.5: Illustration of the Momentum Contrast framework for self-supervised rep [PITH_FULL_IMAGE:figures/full_fig_p026_2_5.png] view at source ↗
Figure 2.6
Figure 2.6. Figure 2.6: Architecture of Bootstrap Your Own Latent (BYOL) for self-supervised vi [PITH_FULL_IMAGE:figures/full_fig_p027_2_6.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: Recurrent neural network for language modeling, unrolled over time. At [PITH_FULL_IMAGE:figures/full_fig_p030_3_1.png] view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: Transformer encoder–decoder architecture for sequence modeling [ [PITH_FULL_IMAGE:figures/full_fig_p032_3_2.png] view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: Architecture of the Show and Tell image captioning model of Vinyals et al. [PITH_FULL_IMAGE:figures/full_fig_p038_4_1.png] view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Schematic overview of the BLIP architecture [ [PITH_FULL_IMAGE:figures/full_fig_p039_4_2.png] view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Illustration of the BLIP-2 architecture with a Querying Transformer (Q [PITH_FULL_IMAGE:figures/full_fig_p040_4_3.png] view at source ↗
Figure 4.4
Figure 4.4. Figure 4.4: Bootstrapping BLIP-2 from different classes of large language models [ [PITH_FULL_IMAGE:figures/full_fig_p041_4_4.png] view at source ↗
Figure 4.5
Figure 4.5. Figure 4.5: High-level Flamingo architecture [3]. One or more images are first en￾coded by a frozen vision backbone, producing dense feature maps. A Perceiver Resampler-a small transformer trained from scratch-consumes these features and out￾puts a fixed-size set of latent visual tokens for each image, decoupling the number of visual tokens from the input resolution. These visual tokens are injected into se￾lected l… view at source ↗
Figure 4.6
Figure 4.6. Figure 4.6: Internal structure of Flamingo’s gated cross-attention (GATED XATTN [PITH_FULL_IMAGE:figures/full_fig_p043_4_6.png] view at source ↗
Figure 4.7
Figure 4.7. Figure 4.7: Schematic of the LLaVA architecture [53]. An input image Xv is encoded by a frozen CLIP Vision Transformer, producing visual features Zv. A learned pro￾jection matrix W (implemented as a small MLP) maps Zv into the language model’s hidden space, yielding a sequence of visual embeddings Hv. These visual tokens are concatenated with the hidden representations Hq of a text instruction Xq and passed into a f… view at source ↗
Figure 4.8
Figure 4.8. Figure 4.8: Three-stage training pipeline for Qwen-VL [ [PITH_FULL_IMAGE:figures/full_fig_p046_4_8.png] view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: Illustration of the Conceptual Captions (CC3M) text normalization [PITH_FULL_IMAGE:figures/full_fig_p056_5_1.png] view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Examples from LAION-5B [71]. Each column shows a text query (Q) and the image with its caption (C) returned as the nearest neighbor in CLIP embedding space. The queries behave like natural user prompts (e.g., “An armchair that looks like an apple”, “pink photo of Tokyo”), while the associated captions summarize the retrieved images. captions, and a scene-graph-style representation linking objects (e.g., … view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: Example from Visual Genome [40]. The image is annotated with object bounding boxes (left), region-level captions describing localized parts of the scene (middle), and a scene graph (bottom) whose nodes denote objects and attributes and whose edges encode labeled relations (e.g., man sits on bench, bench in front of river). 62 [PITH_FULL_IMAGE:figures/full_fig_p070_5_3.png] view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: Examples from the Open Images dataset [42]. Left: image-level labels for multi-label classification. Middle: object detection annotations with category-specific bounding boxes. Right: visual relationship detection annotations, where pairs of boxes (e.g., man, guitar) are linked by relation labels (e.g., holds) [PITH_FULL_IMAGE:figures/full_fig_p071_5_4.png] view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: Examples of clip-caption pairs from HowTo100M [ [PITH_FULL_IMAGE:figures/full_fig_p071_5_5.png] view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: Example video-caption pairs from the WebVid dataset (WebVid2M) [ [PITH_FULL_IMAGE:figures/full_fig_p072_5_6.png] view at source ↗
Figure 5.7
Figure 5.7. Figure 5.7: Example entry from the WIT dataset [76], based on the Wikipedia page for Half Dome. The figure highlights the different textual fields extracted for each image: page title, lead paragraph, section titles, image caption, and reference description. WIT stores these fields (often in multiple languages) alongside the associated image, pro￾viding rich, structured supervision for multilingual multimodal pretra… view at source ↗
Figure 5.8
Figure 5.8. Figure 5.8: Illustrative examples from TextCaps [73]. Each image is paired with multi￾ple captions that explicitly refer to the text visible in the scene (e.g., digits on displays, brand names, slogans). Some caption tokens directly copy text from the image, while others paraphrase it or add inferred information, making the dataset a strong bench￾mark for text-aware visual captioning. 65 [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 5.9
Figure 5.9. Figure 5.9: Example image-caption pairs from the MS COCO Captions benchmark [ [PITH_FULL_IMAGE:figures/full_fig_p074_5_9.png] view at source ↗
Figure 5.10
Figure 5.10. Figure 5.10: Examples from Flickr30k Entities [64]. Each image is annotated with bounding boxes around salient entities (e.g., man, glasses, wedding cake), and cor￾responding noun phrases in the captions are color-coded to match these regions. This phrase-to-region supervision supports training and evaluation of models that must ground textual mentions to specific objects in the image. 66 [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 5.11
Figure 5.11. Figure 5.11: Example question-image-answer triplets from the balanced VQA v2 [PITH_FULL_IMAGE:figures/full_fig_p075_5_11.png] view at source ↗
Figure 5.12
Figure 5.12. Figure 5.12: Example from the GQA dataset [35]. Objects, attributes, and relations (e.g., bowl, apple, green, behind, on top of ) are annotated in the image, and natural￾language questions are derived from the underlying scene graph, such as “Is the bowl to the right of the green apple?” or “What type of fruit in the image is round?”. This design explicitly links questions to structured semantics and supports detail… view at source ↗
Figure 5.13
Figure 5.13. Figure 5.13: Example question-image-answer triplets from OK-VQA [ [PITH_FULL_IMAGE:figures/full_fig_p076_5_13.png] view at source ↗
Figure 5.14
Figure 5.14. Figure 5.14: Example images and questions from the VizWiz dataset [ [PITH_FULL_IMAGE:figures/full_fig_p077_5_14.png] view at source ↗
Figure 5.15
Figure 5.15. Figure 5.15: Example from the RefCOCO family of referring expression datasets [ [PITH_FULL_IMAGE:figures/full_fig_p077_5_15.png] view at source ↗
Figure 5.16
Figure 5.16. Figure 5.16: Representative examples from the TextVQA dataset [ [PITH_FULL_IMAGE:figures/full_fig_p078_5_16.png] view at source ↗
Figure 5.17
Figure 5.17. Figure 5.17: Example from DocVQA [59]. The model must answer multiple questions about a scanned envelope, such as the ZIP code, the postmark date, and the company name. Successful solutions require accurate OCR, spatial layout understanding, and reasoning over several textual elements on the page [PITH_FULL_IMAGE:figures/full_fig_p079_5_17.png] view at source ↗
Figure 5.18
Figure 5.18. Figure 5.18: Example from ChartQA [58]. The model is asked questions about a line chart summarizing survey responses, such as “Which year has the most divergent opinions about Brazil’s economy?” and “What is the peak value of the orange line?”. Solving such problems requires accurate reading of plotted values, comparison between series, and reasoning over trends rather than just recognizing objects. 71 [PITH_FULL_I… view at source ↗
Figure 5.19
Figure 5.19. Figure 5.19: Overview of the POPE (Polling-based Object Probing Evaluation) pipeline for measuring object hallucination in LVLMs [46]. Given an input image, ground-truth object categories (e.g., person, chair, umbrella) are obtained from human or automatic annotations, while nonexistent objects (e.g., dog, table, surfboard) are sampled from ran￾dom, frequent, or adversarial distributions. For each object, the model … view at source ↗
Figure 5.20
Figure 5.20. Figure 5.20: Ability taxonomy in MMBench [54]. The benchmark organizes visual questions into hierarchical ability dimensions, separating broad Perception and Reason￾ing categories and further decomposing them into fine-grained skills (outer ring), such as OCR, spatial relationships, logical reasoning, and identity or attribute reasoning. 72 [PITH_FULL_IMAGE:figures/full_fig_p080_5_20.png] view at source ↗
Figure 5.21
Figure 5.21. Figure 5.21: Overview of the MMMU benchmark [86]. The dataset comprises ∼11.5K college-level, multiple-choice problems drawn from six broad disciplines and 30 sub￾jects, featuring heterogeneous image types (e.g., diagrams, charts, photographs), in￾terleaved text and images, and questions designed to probe expert-level perception, knowledge, and reasoning. 73 [PITH_FULL_IMAGE:figures/full_fig_p081_5_21.png] view at source ↗
read the original abstract

When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: \emph{it is too easy to get lost}. The field moves quickly, new model names appear constantly, and the gap between ``I know the buzzwords'' and ``I actually understand how this works'' can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript offers an expository overview of Vision-Language Models, with the central claim that a non-exhaustive, intuition-focused mental map can supply enough structure for readers to engage new VLM papers confidently and to design systems without assembling components blindly. It explicitly disclaims exhaustive coverage of datasets, benchmarks, or model variants, framing its contribution as durable pedagogical structure rather than completeness or novel technical results.

Significance. If the delivered mental map proves clear and durable, the work could hold meaningful pedagogical value in the fast-moving cs.AI field by reducing the gap between superficial buzzword knowledge and practical understanding, thereby supporting more effective paper reading and system design in multimodal models.

minor comments (2)
  1. Abstract: the text contains unrendered LaTeX such as ``it is too easy to get lost''; ensure consistent formatting and rendering in the final version.
  2. Throughout: repeated self-reference to ``this book'' creates ambiguity about the intended format and venue; clarify whether the manuscript is a journal article, survey, or book chapter.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript's pedagogical intent and for recommending minor revision. The work deliberately prioritizes a durable, intuition-focused mental map over exhaustive coverage of datasets or models, as stated in the abstract, to help readers navigate the fast-moving VLM literature with greater confidence.

Circularity Check

0 steps flagged

No significant circularity; purely expository overview

full rationale

The manuscript contains no derivations, equations, fitted parameters, predictions, or self-citations that could form a load-bearing chain. Its stated purpose is to supply an intuition-focused mental map for readers, explicitly disclaiming exhaustive coverage of datasets or models. No step reduces by construction to its own inputs, and the central claim remains independent of any internal fitting or renaming of results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an explanatory book on existing technology, the work introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5508 in / 895 out tokens · 28910 ms · 2026-05-11T02:40:51.733916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · 3 internal anchors

  1. [1]

    International Conference on Learning Representations (ICLR) , year =

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations (ICLR) , year =

  2. [2]

    Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

    Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

  3. [3]

    Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian , booktitle =. Faster

  4. [4]

    He, Kaiming and Gkioxari, Georgia and Doll. Mask. IEEE International Conference on Computer Vision (ICCV) , year =

  5. [5]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  6. [6]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  7. [7]

    Training language models to follow instructions with human feedback

    Training Language Models to Follow Instructions with Human Feedback , author =. arXiv preprint arXiv:2203.02155 , year =

  8. [8]

    LLaMA: Open and Efficient Foundation Language Models

    LLaMA: Open and Efficient Foundation Language Models , author =. arXiv preprint arXiv:2302.13971 , year =

  9. [9]

    Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

    Scaling up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

  10. [10]

    Lu, Jiasen and Batra, Dhruv and Parikh, Devi and Lee, Stefan , booktitle =

  11. [11]

    Tan, Hao and Bansal, Mohit , booktitle =

  12. [12]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Flamingo: a Visual Language Model for Few-Shot Learning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  13. [13]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author =. arXiv preprint arXiv:2301.12597 , year =

  14. [14]

    Microsoft

    Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. European Conference on Computer Vision (ECCV) , year =

  15. [15]

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Show and Tell: A Neural Image Caption Generator , author =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  16. [16]

    Lawrence and Parikh, Devi , booktitle =

    Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , booktitle =

  17. [17]

    Making the V in

    Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , booktitle =. Making the V in

  18. [18]

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Deep Visual-Semantic Alignments for Generating Image Descriptions , author =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  19. [19]

    IEEE International Conference on Computer Vision (ICCV) , year =

    Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , author =. IEEE International Conference on Computer Vision (ICCV) , year =

  20. [20]

    Empirical Methods in Natural Language Processing (EMNLP) , year =

    ReferItGame: Referring to Objects in Photographs of Natural Scenes , author =. Empirical Methods in Natural Language Processing (EMNLP) , year =

  21. [21]

    Gradient-Based Learning Applied to Document Recognition , journal =

    LeCun, Yann and Bottou, L. Gradient-Based Learning Applied to Document Recognition , journal =

  22. [22]

    , title =

    Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E. , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  23. [23]

    International Conference on Learning Representations (ICLR) , year =

    Simonyan, Karen and Zisserman, Andrew , title =. International Conference on Learning Representations (ICLR) , year =

  24. [24]

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

  25. [25]

    Feature Pyramid Networks for Object Detection , booktitle =

    Lin, Tsung-Yi and Doll. Feature Pyramid Networks for Object Detection , booktitle =

  26. [26]

    and Fei-Fei, Li , title =

    Russakovsky, Olga and Deng, Jia and Su, Hao and Krause, Jonathan and Satheesh, Sanjeev and Ma, Sean and Huang, Zhiheng and Karpathy, Andrej and Khosla, Aditya and Bernstein, Michael and Berg, Alexander C. and Fei-Fei, Li , title =. International Journal of Computer Vision (IJCV) , year =

  27. [27]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

    Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey , title =. Proceedings of the 37th International Conference on Machine Learning (ICML) , year =

  28. [28]

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    He, Kaiming and Fan, Haoqi and Wu, Yuxin and Xie, Saining and Girshick, Ross , title =. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  29. [29]

    Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , booktitle =

    Grill, Jean-Bastien and Strub, Florian and Altch. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , booktitle =. 2020 , volume =

  30. [30]

    Emerging Properties in Self-Supervised Vision Transformers , booktitle =

    Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J. Emerging Properties in Self-Supervised Vision Transformers , booktitle =. 2021 , pages =

  31. [31]

    Masked Autoencoders Are Scalable Vision Learners , booktitle =

    He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll. Masked Autoencoders Are Scalable Vision Learners , booktitle =. 2022 , pages =

  32. [32]

    , title =

    Elman, Jeffrey L. , title =. Cognitive Science , volume =

  33. [33]

    Long Short-Term Memory , journal =

    Hochreiter, Sepp and Schmidhuber, J. Long Short-Term Memory , journal =

  34. [34]

    Learning Phrase Representations using

    Cho, Kyunghyun and van Merri. Learning Phrase Representations using. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

  35. [35]

    On the Opportunities and Risks of Foundation Models , author =

  36. [36]

    Language Models are Unsupervised Multitask Learners , author =

  37. [37]

    2019 , pages =

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , pages =

  38. [38]

    Proceedings of the 32nd International Conference on Machine Learning (ICML) , pages =

    Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , author =. Proceedings of the 32nd International Conference on Machine Learning (ICML) , pages =

  39. [39]

    and Gotmare, Akhilesh and Joty, Shafiq and Xiong, Caiming and Hoi, Steven C

    Li, Junnan and Selvaraju, Ramprasaath R. and Gotmare, Akhilesh and Joty, Shafiq and Xiong, Caiming and Hoi, Steven C. H. , booktitle =

  40. [40]

    Advances in Neural Information Processing Systems , year =

    Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems , year =

  41. [41]

    2023 , note =

    Qwen Technical Report , author =. 2023 , note =

  42. [42]

    Advances in Neural Information Processing Systems , year =

    Language Is Not All You Need: Aligning Perception with Language Models , author =. Advances in Neural Information Processing Systems , year =

  43. [43]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

    PaLI-X: On Scaling Up a Multilingual Vision and Language Model , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

  44. [44]

    Zhai, Xiaohua and Wang, Xingyi and Mustafa, Basil and Kolesnikov, Alexander and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Houlsby, Neil and others , booktitle =

  45. [45]

    Chen, Yen-Chun and Li, Linjie and Yu, Licheng and Kholy, Ahmed El and Ahmed, Faisal and Gan, Zhe and Cheng, Yu and Liu, Jingjing , booktitle =

  46. [46]

    Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng , booktitle =

  47. [47]

    Parameter-Efficient Transfer Learning for

    Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bryan and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for

  48. [48]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

  49. [49]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

    Piyush Sharma and Nan Ding and Sebastian Goodman and Radu Soricut , title =. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

  50. [50]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Soravit Changpinyo and Piyush Sharma and Nan Ding and Radu Soricut , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  51. [51]

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , journal =

    Ranjay Krishna and Yuke Zhu and Oliver Groth and Justin Johnson and Kenji Hata and Joshua Kravitz and Stephanie Chen and Yannis Kalantidis and Li. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , journal =. 2017 , volume =

  52. [52]

    The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale , booktitle =

    Alina Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont. The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale , booktitle =

  53. [53]

    HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , booktitle =

    Antoine Miech and Dimitri Zhukov and Jean. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , booktitle =

  54. [54]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

    Max Bain and Arsha Nagrani and G\"ul Varol and Andrew Zisserman , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

  55. [55]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Jun Xu and Tao Mei and Ting Yao and Yong Rui , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  56. [56]

    Proceedings of the 30th International World Wide Web Conference (WWW) , year =

    Krishna Srinivasan and Karthik Raman and Jiecao Chen and Michael Bendersky and Marc Najork , title =. Proceedings of the 30th International World Wide Web Conference (WWW) , year =

  57. [57]

    Harley and Alex Ufkes and Konstantinos G

    Adam W. Harley and Alex Ufkes and Konstantinos G. Derpanis , title =. Proceedings of the International Conference on Document Analysis and Recognition (ICDAR) , year =

  58. [58]

    PubLayNet: Largest Dataset Ever for Document Layout Analysis , booktitle =

    Xu Zhong and Jianbin Tang and Antonio Jimeno. PubLayNet: Largest Dataset Ever for Document Layout Analysis , booktitle =

  59. [59]

    Computer Vision --

    Oleksii Sidorov and Ronghang Hu and Marcus Rohrbach and Amanpreet Singh , title =. Computer Vision --

  60. [60]

    Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

    Christoph Schuhmann and Romain Beaumont and Richard Vencu and Cade Gordon and Ross Wightman and Mehdi Cherti and Theo Coombes and Aarush Katta and Clayton Mullis and Mitchell Wortsman and Patrick Schramowski and Srivatsa Kundurthy and Katherine Crowson and Ludwig Schmidt and Robert Kaczmarczyk and Jenia Jitsev , title =. Advances in Neural Information Pro...

  61. [61]

    2022 , howpublished =

    PaLI: A Jointly-Scaled Multilingual Language-Image Model , author =. 2022 , howpublished =

  62. [62]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    PaLI-X: On Scaling Up a Multilingual Vision and Language Model , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  63. [63]

    2024 , howpublished =

    LLaVA-NeXT: Improved Reasoning,. 2024 , howpublished =

  64. [64]

    2024 , howpublished =

    LLaVA-NeXT-Interleave: Tackling Multi-Image, Video, and 3D in Large Multimodal Models , author =. 2024 , howpublished =

  65. [65]

    Qwen2.5-

    Shuai Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Sibo Song and others , year =. Qwen2.5-

  66. [66]

    Shuai Bai and Yuxuan Cai and Ruizhe Chen and Keqin Chen and Zesen Cheng and others , year =. Qwen3-

  67. [67]

    2025 , howpublished =

    Ovis2.5 Technical Report , author =. 2025 , howpublished =

  68. [68]

    2025 , howpublished =

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author =. 2025 , howpublished =

  69. [69]

    Proceedings of the 5th Workshop on Vision and Language , year =

    Multi30K: Multilingual English--German Image Descriptions , author =. Proceedings of the 5th Workshop on Vision and Language , year =

  70. [70]

    Microsoft

    Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. Proceedings of the European Conference on Computer Vision (ECCV) , year =

  71. [71]

    Transactions of the Association for Computational Linguistics , volume =

    From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions , author =. Transactions of the Association for Computational Linguistics , volume =

  72. [72]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

    nocaps: Novel Object Captioning at Scale , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

  73. [73]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  74. [74]

    Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle =

  75. [75]

    and Guo, Abigale and Lin, Chi and Grauman, Kristen and Wilson, Caroline and Bigham, Jeffrey P

    Gurari, Danna and Li, Qing and Stangl, Amanda J. and Guo, Abigale and Lin, Chi and Grauman, Kristen and Wilson, Caroline and Bigham, Jeffrey P. , booktitle =

  76. [76]

    Proceedings of the European Conference on Computer Vision (ECCV) , year =

    Modeling Context in Referring Expressions , author =. Proceedings of the European Conference on Computer Vision (ECCV) , year =

  77. [77]

    Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara , booktitle =

  78. [78]

    and Wang, Liwei and Cervantes, Chris M

    Plummer, Bryan A. and Wang, Liwei and Cervantes, Chris M. and Caicedo, Juan C. and Hockenmaier, Julia and Lazebnik, Svetlana , booktitle =

  79. [79]

    Singh, Ankur and Natarajan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Batra, Dhruv and Parikh, Devi and Rohrbach, Marcus , booktitle =

  80. [80]

    Mathew, Minesh and Karatzas, Dimosthenis and Jawahar, C. V. , booktitle =

Showing first 80 references.