pith. machine review for the scientific record. sign in

arxiv: 2306.14824 · v3 · submitted 2023-06-26 · 💻 cs.CL · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Kosmos-2: Grounding Multimodal Large Language Models to the World

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:13 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords multimodal large language modelsvisual groundingreferring expressionsphrase groundingGrIT datasetlocation tokensembodied AImultimodal perception
0
0 comments X

The pith

Kosmos-2 adds visual grounding to multimodal LLMs by encoding referring expressions as Markdown links to bounding-box location tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to give a multimodal large language model the ability to point text descriptions at specific image regions while keeping its existing language and instruction-following skills intact. It does so by rewriting referring expressions in the form of Markdown links whose targets are sequences of special location tokens, then training on a newly assembled collection of grounded image-text pairs called GrIT. The resulting model is tested on grounding benchmarks, referring-expression generation, and standard multimodal tasks, demonstrating that the added capability transfers to downstream uses. If the approach holds, multimodal models can begin to treat the visual world as an addressable space rather than a passive input.

Core claim

Kosmos-2 acquires grounding by representing each referring expression as a Markdown link of the form [text span](bounding boxes), where the bounding boxes are encoded as ordered sequences of location tokens; training on the constructed GrIT corpus of grounded image-text pairs then integrates this behavior into the model without loss of its prior multimodal and in-context learning abilities, as measured across referring comprehension, phrase grounding, referring generation, and general perception-language tasks.

What carries the argument

The Markdown-link representation of referring expressions, written as [text span](bounding boxes) with bounding boxes expressed as sequences of location tokens.

If this is right

  • The model handles referring expression comprehension and phrase grounding tasks at competitive levels.
  • It can generate referring expressions that correctly identify image regions.
  • Performance on perception-language and general language tasks remains intact.
  • Grounding becomes available for integration into a range of downstream multimodal applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same link-based grounding format could be applied to video or 3-D scenes to let models refer to objects across time or depth.
  • Embodied agents might use the resulting outputs to plan physical actions that reference specific world locations.
  • Unified training on grounded text and world-modeling data could reduce the separation between language and spatial reasoning modules.

Load-bearing premise

Encoding referring expressions as Markdown links to location-token sequences and training on GrIT will add usable grounding without degrading the model's other language and multimodal capabilities.

What would settle it

If Kosmos-2 shows no improvement over non-grounded multimodal baselines on referring-expression comprehension or phrase-grounding benchmarks, or if scores on standard visual-question-answering tasks drop after the same training, the grounding method would be shown ineffective.

read the original abstract

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Kosmos-2, a multimodal large language model that represents referring expressions as Markdown links of the form [text span](bounding boxes) where bounding boxes are encoded as sequences of location tokens. The model is trained on a newly constructed GrIT dataset of grounded image-text pairs in addition to existing multimodal corpora. Evaluations cover referring expression comprehension, phrase grounding, referring expression generation, perception-language tasks, and language understanding/generation. The authors claim that Kosmos-2 integrates grounding into downstream applications and lays the foundation for Embodiment AI via convergence of language, perception, action, and world modeling.

Significance. If the location-token representation and GrIT training demonstrably add effective grounding while preserving other MLLM capabilities, the work would provide a practical technical contribution to multimodal alignment. The GrIT dataset construction and Markdown-link format could serve as reusable baselines. The broader Embodiment-AI claim, however, rests on an untested extrapolation from perception-language results to action and world modeling.

major comments (2)
  1. [Abstract] Abstract: The statement that the work 'lays out the foundation for the development of Embodiment AI' and constitutes 'a key step toward artificial general intelligence' via 'the big convergence of language, multimodal perception, action, and world modeling' is not supported by the listed evaluations ((i) referring expression comprehension/phrase grounding, (ii) referring expression generation, (iii) perception-language tasks, (iv) language understanding/generation). No action, planning, navigation, or interaction tasks are described, so the convergence claim remains aspirational.
  2. [Evaluation] Evaluation sections: The manuscript must supply quantitative results, baselines, and ablations for the grounding tasks (e.g., comparison of location-token accuracy against prior phrase-grounding methods). Without these, it is impossible to verify whether the Markdown-link representation adds grounding capability without degrading other MLLM performance.
minor comments (2)
  1. [Abstract] The abstract states that code and pretrained models are available at https://aka.ms/kosmos-2 but provides no license, version, or reproducibility details.
  2. [Method] Notation for location tokens should be defined once with an explicit example (e.g., the exact token vocabulary and how bounding-box coordinates are discretized) rather than introduced only in passing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback. We respond point-by-point to the major comments below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The statement that the work 'lays out the foundation for the development of Embodiment AI' and constitutes 'a key step toward artificial general intelligence' via 'the big convergence of language, multimodal perception, action, and world modeling' is not supported by the listed evaluations ((i) referring expression comprehension/phrase grounding, (ii) referring expression generation, (iii) perception-language tasks, (iv) language understanding/generation). No action, planning, navigation, or interaction tasks are described, so the convergence claim remains aspirational.

    Authors: We agree that the current evaluations are confined to perception-language tasks and do not include action, planning, or navigation experiments. The abstract statement is intended to highlight how the introduced grounding capability provides a necessary foundation for future integration with action and world modeling. To prevent any overstatement, we will revise the abstract to more precisely delineate the present contributions while framing the convergence as a direction for subsequent research. revision: yes

  2. Referee: [Evaluation] Evaluation sections: The manuscript must supply quantitative results, baselines, and ablations for the grounding tasks (e.g., comparison of location-token accuracy against prior phrase-grounding methods). Without these, it is impossible to verify whether the Markdown-link representation adds grounding capability without degrading other MLLM performance.

    Authors: The manuscript already reports quantitative results on referring expression comprehension and phrase grounding, together with comparisons to prior methods on standard benchmarks. We recognize that additional ablations isolating the location-token representation and Markdown-link format would further clarify its incremental benefit and confirm preservation of other capabilities. We will therefore add these ablations and expanded baseline analyses in the revised version. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain; claims are empirical

full rationale

The paper describes an empirical construction: refer expressions are represented as Markdown links to location-token sequences, GrIT data is built from multimodal corpora, and Kosmos-2 is trained on this data plus existing MLLM corpora. All listed evaluations (referring expression comprehension, phrase grounding, generation, perception-language tasks, language understanding) are direct experimental outcomes of this training. The Embodiment-AI foundation claim is a forward-looking statement with no equations, uniqueness theorems, or self-citations that reduce the result to its inputs by construction. No self-definitional, fitted-prediction, or ansatz-smuggling patterns appear; the work is self-contained against its own benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the effectiveness of the new Markdown grounding format and the quality of the constructed GrIT data; both are introduced by the paper without external benchmarks in the abstract.

axioms (1)
  • domain assumption Multimodal corpora contain extractable grounded image-text pairs suitable for training
    Invoked when constructing the GrIT dataset from existing corpora
invented entities (2)
  • GrIT dataset no independent evidence
    purpose: Large-scale training data of grounded image-text pairs
    Newly constructed for this work
  • Location tokens no independent evidence
    purpose: Represent bounding boxes as sequences the model can generate
    New token sequences added to the model's vocabulary

pith-pipeline@v0.9.0 · 5543 in / 1205 out tokens · 70049 ms · 2026-05-12T05:13:43.008469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  2. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  3. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  4. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  5. RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

    cs.CV 2026-05 unverdicted novelty 7.0

    RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.

  6. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

  7. Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings

    cs.CV 2026-04 conditional novelty 7.0

    Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.

  8. MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

  9. STORM: End-to-End Referring Multi-Object Tracking in Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.

  10. MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.

  11. OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.

  12. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  13. 3D-VLA: A 3D Vision-Language-Action Generative World Model

    cs.CV 2024-03 unverdicted novelty 7.0

    3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

  14. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    cs.CL 2023-07 unverdicted novelty 7.0

    SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

  15. Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations

    cs.CV 2026-05 unverdicted novelty 6.0

    Controlled counterfactual perturbations reveal no correlation between embedding cosine similarity and approximation behavior in two visual grounding models.

  16. LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

    cs.CV 2026-05 unverdicted novelty 6.0

    LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...

  17. AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion

    cs.CV 2026-05 unverdicted novelty 6.0

    AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.

  18. CoNewsReader: Supporting Comprehensive Understanding and Raising Critical Thoughts on Social Media News Through Comments

    cs.HC 2026-04 conditional novelty 6.0

    CoNewsReader integrates user comments with an LLM to improve critical news reading on social media, with a 24-participant study showing gains in comprehension and critical thinking over baseline interfaces.

  19. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 6.0

    Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...

  20. Latent Denoising Improves Visual Alignment in Large Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.

  21. G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.

  22. CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

  23. Perception Encoder: The best visual embeddings are not at the output of the network

    cs.CV 2025-04 unverdicted novelty 6.0

    Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...

  24. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  25. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    cs.CV 2023-11 conditional novelty 6.0

    A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

  26. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    eess.AS 2023-11 unverdicted novelty 6.0

    Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.

  27. Retentive Network: A Successor to Transformer for Large Language Models

    cs.CL 2023-07 unverdicted novelty 6.0

    RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

  28. CoNewsReader: Supporting Comprehensive Understanding and Raising Critical Thoughts on Social Media News Through Comments

    cs.HC 2026-04 unverdicted novelty 5.0

    CoNewsReader leverages comments and LLMs to support critical news reading on social media, with a within-subjects study of 24 students showing more engaging experiences and better comprehension and critical thought pe...

  29. DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA

    cs.CL 2026-04 unverdicted novelty 5.0

    DIAGRAMS introduces a schema-driven annotation tool that proposes reasoning-level evidence regions for Diagram QA pairs and reports 85.39% precision and 75.30% recall against human final selections on six datasets.

  30. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 5.0

    Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.

  31. Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

    cs.CV 2026-04 unverdicted novelty 5.0

    PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.

  32. AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

    cs.CL 2026-04 unverdicted novelty 5.0

    AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.

  33. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  34. A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    A patch-augmented cross-view regularization method reduces backdoor attack success rates in multimodal LLMs by enforcing output differences between original and perturbed views while using entropy constraints to prese...

  35. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    cs.RO 2025-01 unverdicted novelty 5.0

    SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...

  36. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  37. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  38. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  39. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  40. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    cs.CV 2025-02 unverdicted novelty 4.0

    SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...

  41. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  42. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

  43. Scrapyard AI

    cs.CY 2026-04 unverdicted novelty 3.0

    Obsolete AI models left behind by rapid development can be repurposed like scrap materials to analyze and communicate the environmental and social effects of global mining.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 41 Pith papers · 5 internal anchors

  1. [1]

    arXiv preprint arXiv:2201.07520 , year=

    [AHR+22] Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A causal masked multimodal model of the Internet. ArXiv, abs/2201.07520,

  2. [2]

    Breaking common sense: WHOOPS! a vision-and-language benchmark of synthetic and compositional images

    [BGBH+23] Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, and Roy Schwartz. Breaking common sense: WHOOPS! a vision-and-language benchmark of synthetic and compositional images. ArXiv, abs/2303.07274,

  3. [3]

    Cheng, B., Girshick, R., Dollar, P., Berg, A

    [CSL+21] Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geo rey E. Hinton. Pix2seq: A language modeling framework for object detection. ArXiv, abs/2109.10852,

  4. [4]

    Coarse-to-fine vision-language pre-training with fusion in the backbone

    [DKG+22] Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, and Lijuan Wang. Coarse-to-fine vision-language pre-training with fusion in the backbone. ArXiv, abs/2206.07643,

  5. [5]

    PaLM-E: An Embodied Multimodal Language Model

    [DXS+23] Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-...

  6. [6]

    Language is not all you need: Aligning perception with language models

    [HDW+23] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045,

  7. [7]

    Language models are general-purpose interfaces

    [HSD+22] Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shum- ing Ma, and Furu Wei. Language models are general-purpose interfaces. ArXiv, abs/2206.06336,

  8. [8]

    [JMC+23] Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu Chen, Ahmed Hassan Awadallah, Damien Jose, and Xiang Ren

    Association for Computational Linguistics. [JMC+23] Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu Chen, Ahmed Hassan Awadallah, Damien Jose, and Xiang Ren. Grill: Grounded vision- language pre-training via aligning text and image regions. ArXiv, abs/2305.14676,

  9. [9]

    Mdetr - modulated detection for end-to-end multi-modal understanding

    [KSL+21] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr - modulated detection for end-to-end multi-modal understanding. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1760– 1770,

  10. [10]

    W., Tay, Y ., Zhou, D., Le, Q

    [LHV+23] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688,

  11. [11]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    12 [LLSH23] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597,

  12. [12]

    Visual Instruction Tuning

    [LLWL23] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

  13. [13]

    Visualbert: A simple and perfor- 13 mant baseline for vision and language

    [LYY+19] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visual- bert: A simple and performant baseline for vision and language.ArXiv, abs/1908.03557,

  14. [14]

    [MHT+15] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana-Maria Camburu, Alan Loddon Yuille, and Kevin P. Murphy. Generation and comprehension of unambiguous object descriptions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11–20,

  15. [15]

    TorchScale: Transformers at scale

    [MWH+22] Shuming Ma, Hongyu Wang, Shaohan Huang, Wenhui Wang, Zewen Chi, Li Dong, Alon Benhaim, Barun Patra, Vishrav Chaudhary, Xia Song, and Furu Wei. TorchScale: Transformers at scale. CoRR, abs/2211.13184,

  16. [16]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    [SBV+22] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402,

  17. [17]

    Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

    [WCC+23] Wen Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Y . Qiao, and Jifeng Dai. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. ArXiv, abs/2305.11175,

  18. [18]

    Foundation transformers.CoRR, abs/2210.06423,

    [WMH+22] Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, and Furu Wei. Foundation transformers.CoRR, abs/2210.06423,

  19. [19]

    SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

    [WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537,

  20. [20]

    Berg, and Tamara L

    [YPY+16] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. ArXiv, abs/1608.00272,

  21. [21]

    [YTBB17] Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L. Berg. A joint speaker-listener- reinforcer model for referring expressions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3521–3529. IEEE Computer Society,