pith. machine review for the scientific record. sign in

arxiv: 2205.14100 · v5 · submitted 2022-05-27 · 💻 cs.CV

Recognition: 2 theorem links

GIT: A Generative Image-to-text Transformer for Vision and Language

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords generative image-to-textvision-language pre-trainingimage captioningvisual question answeringtransformerstate-of-the-artTextCaps
0
0 comments X

The pith

A simplified generative image-to-text transformer unifies vision-language tasks and sets new state-of-the-art results on 12 benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper designs GIT as one image encoder paired with one text decoder trained under a single language modeling objective. By scaling pre-training data volume and model size, the approach aims to match or exceed the results of prior methods that rely on complex multi-module structures and external components such as object detectors or OCR. It reports new state-of-the-art scores across 12 benchmarks for image and video captioning as well as question answering, including the first exceedance of human performance on TextCaps. A sympathetic reader would care because the work tests whether architectural simplicity plus scale can replace specialized pipelines in multimodal understanding.

Core claim

GIT establishes that a single image encoder and text decoder under language modeling, scaled in data and size, unifies vision-language tasks and achieves new state-of-the-art results on 12 benchmarks, surpassing human performance on TextCaps with 138.2 versus 125.5 CIDEr.

What carries the argument

The generative image-to-text transformer consisting of one image encoder and one text decoder trained with a single language modeling task.

Load-bearing premise

That increasing pre-training data volume and model size with a single language-modeling objective on a simple encoder-decoder architecture is enough to surpass prior specialized methods on vision-language tasks.

What would settle it

A model that retains complex multi-modal encoders plus external detectors and OCR but matches the same pre-training data volume and parameter count still falls short of GIT scores on the 12 reported benchmarks.

read the original abstract

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GIT, a generative image-to-text transformer consisting of a single image encoder and text decoder trained end-to-end under a unified language modeling objective. By scaling pre-training data volume and model size, the authors claim new state-of-the-art results on 12 vision-language benchmarks (including image/video captioning and VQA), with the model surpassing human performance on TextCaps (138.2 vs. 125.5 CIDEr) for the first time; they also present generation-based schemes for image classification and scene text recognition.

Significance. If the empirical claims hold after verification, the work would be significant for demonstrating that a minimal encoder-decoder architecture under a single LM objective can unify tasks and exceed prior complex models that rely on external modules (detectors, OCR). This would underscore the value of scale over architectural elaboration and simplify the design space for vision-language models. Code release supports reproducibility.

major comments (2)
  1. [Experiments] Experimental sections: the manuscript reports strong benchmark numbers and SOTA claims but provides no ablations or matched-scale re-runs that hold pre-training data volume and parameter count fixed while restoring external modules or multi-encoder designs from prior work. This is load-bearing for the central unification claim, as it leaves open whether gains derive primarily from scaling rather than simplification.
  2. [TextCaps Evaluation] TextCaps results (Table X, CIDEr row): the 138.2 score surpassing human performance is presented without error analysis or ablation on how the image encoder captures scene text in the absence of explicit OCR; this weakens confidence that the architecture generalizes beyond the specific training distribution.
minor comments (2)
  1. [Method] Notation for the image encoder and text decoder could be clarified with explicit equations showing the joint LM loss formulation.
  2. [Figures] Figure captions for benchmark comparisons should include the exact prior methods and their scales for direct visual comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of our simplified architecture. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experimental sections: the manuscript reports strong benchmark numbers and SOTA claims but provides no ablations or matched-scale re-runs that hold pre-training data volume and parameter count fixed while restoring external modules or multi-encoder designs from prior work. This is load-bearing for the central unification claim, as it leaves open whether gains derive primarily from scaling rather than simplification.

    Authors: We agree that matched-scale ablations re-implementing prior complex models (with external modules or multi-encoder designs) at identical data volume and parameter count would provide stronger isolation of the simplification benefit. Such experiments are computationally prohibitive at our scale. Our evidence instead rests on consistent outperformance of reported prior SOTA results that used more elaborate designs, combined with our internal scaling curves showing gains from model size and data. We will revise the experimental discussion to explicitly note this limitation and emphasize that the unification claim is supported by the single-architecture results rather than direct head-to-head re-runs. revision: partial

  2. Referee: [TextCaps Evaluation] TextCaps results (Table X, CIDEr row): the 138.2 score surpassing human performance is presented without error analysis or ablation on how the image encoder captures scene text in the absence of explicit OCR; this weakens confidence that the architecture generalizes beyond the specific training distribution.

    Authors: We acknowledge the value of error analysis for the TextCaps result. The image encoder is a Vision Transformer pre-trained on large-scale image-text pairs that naturally contain scene text, allowing implicit learning of text recognition within the generative objective. We will add a dedicated error analysis subsection with qualitative examples of generated captions on text-rich images, attention visualizations highlighting text regions, and a breakdown of failure cases. A controlled ablation removing all text from pre-training data is not feasible within our compute budget but would be a useful direction for future work; the current results on TextCaps already demonstrate generalization to scene-text understanding. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical scaling results on public benchmarks

full rationale

The paper describes an empirical pipeline: a simplified single-encoder single-decoder transformer is pre-trained with a language-modeling objective on a large corpus and then fine-tuned/evaluated on standard vision-language benchmarks. No equations, uniqueness theorems, or fitted parameters are presented as 'predictions' that reduce to the inputs by construction. Performance numbers (e.g., TextCaps CIDEr) are reported outcomes of training and testing, not self-definitions or renamings of prior results. Self-citations, if present, are not load-bearing for the central claim; the unification argument rests on the observed benchmark margins rather than any circular reduction. This is a standard scaling experiment whose validity can be checked externally by re-training or re-evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of scaling a simplified transformer under language modeling; no new physical entities are postulated and the main assumptions are standard in the transformer literature.

free parameters (2)
  • model scale
    The paper states that model size was increased but provides no specific parameter count or fitting procedure in the abstract.
  • pre-training data volume
    Increased pre-training data is cited as a performance driver but exact dataset sizes or selection criteria are not detailed in the abstract.
axioms (1)
  • domain assumption A single language modeling objective on paired image-text data is sufficient to learn unified vision-language representations.
    Invoked by the choice to train the entire model end-to-end under one generative task without auxiliary losses or modules.

pith-pipeline@v0.9.0 · 5512 in / 1280 out tokens · 27024 ms · 2026-05-16T20:51:12.001343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

    cs.CV 2026-03 unverdicted novelty 7.0

    WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.

  2. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    cs.CV 2023-10 unverdicted novelty 7.0

    HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.

  3. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  4. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  5. Language Is Not All You Need: Aligning Perception with Language Models

    cs.CL 2023-02 conditional novelty 7.0

    Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

  6. PaLI: A Jointly-Scaled Multilingual Language-Image Model

    cs.CV 2022-09 conditional novelty 7.0

    PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

  7. VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

    cs.SE 2026-05 unverdicted novelty 6.0

    VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...

  8. Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Embedding Arithmetic performs vector operations in the embedding space of T2I models to mitigate bias at inference time, outperforming baselines on diversity while preserving coherence via a new Concept Coherence Score.

  9. ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.

  10. From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

    cs.LG 2026-03 unverdicted novelty 6.0

    EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding ...

  11. CogVLM: Visual Expert for Pretrained Language Models

    cs.CV 2023-11 conditional novelty 6.0

    CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.

  12. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    cs.CV 2023-06 accept novelty 6.0

    A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

  13. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  14. Sigmoid Loss for Language Image Pre-Training

    cs.CV 2023-03 conditional novelty 6.0

    SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.

  15. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    cs.CV 2023-03 unverdicted novelty 6.0

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

  16. Text-Guided Multi-Scale Frequency Representation Adaptation

    cs.CV 2026-05 unverdicted novelty 5.0

    FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.

  17. Let ViT Speak: Generative Language-Image Pre-training

    cs.CV 2026-05 unverdicted novelty 5.0

    GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.

  18. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 18 Pith papers · 11 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198,

  2. [2]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

  3. [3]

    promod

    98 on it. (13) GITTextcaps: The station logo. (14) GITTextcaps: A sign with the numbers 3, 642, 039, 055 on it. (15) GITMJST: republic GITMJST: peperoni GITMJST: jewellery GITMJST: promod GITTextcaps: The word republic is on a black background. (16) GITMJST: republic GITTextcaps: A sign that says republic on it. (17) GITTextcaps: A black sign that says 'p...

  4. [4]

    Bluetooth Beats buy beats by dr dre 3 wireless bluetooth headphones

    The empire state building the empire state building was built in 1930 and is 1,472 feet tall. Bluetooth Beats buy beats by dr dre 3 wireless bluetooth headphones. (21) (22) (23) (24) (25) Others: banknote, jersey, food, celebrity, character, logo etc. ImId: 23576 ImId: 24480 ImId: 24973 ImId: 27193 ImId: 24407 Nike nfl Arizona Cardinals #32 Tyrann Mathieu...

  5. [5]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,

  6. [6]

    Universal captioner: Long-tail vision-and-language model training through content-style separation.arXiv preprint arXiv:2111.12727,

    Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, and Rita Cucchiara. Universal captioner: Long-tail vision-and-language model training through content-style separation.arXiv preprint arXiv:2111.12727,

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

  8. [8]

    An empirical study of training end-to-end vision-and-language transformers

    Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, and Michael Zeng. An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv: 2111.02387,

  9. [9]

    Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition

    Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. InCVPR, 2021a. Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, and Zicheng Liu. Injecting semantic concepts into end-to-end image captionin...

  10. [10]

    Structured multimodal attentions for textvqa.arXiv preprint arXiv:2006.00753,

    Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton van den Hengel, and Qi Wu. Structured multimodal attentions for textvqa.arXiv preprint arXiv:2006.00753,

  11. [11]

    Captioning images taken by people who are blind.arXiv preprint arXiv:2002.08565,

    Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind.arXiv preprint arXiv:2002.08565,

  12. [12]

    Scaling up vision-language pre-training for image captioning.arXiv preprint arXiv:2111.12233, 2021a

    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning.arXiv preprint arXiv:2111.12233, 2021a. Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. VIVO: surpassing human performance in novel object captioning with visual voca...

  13. [13]

    arXiv preprint arXiv:2004.00849 , year=

    Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers.arXiv preprint arXiv:2004.00849,

  14. [14]

    Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

    Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial neural networks for natural scene text recognition.arXiv preprint arXiv:1406.2227,

  15. [15]

    Icdar 2013 robust reading competition

    Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. Icdar 2013 robust reading competition. InICDAR,

  16. [16]

    Icdar 2015 competition on robust reading

    Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar 2015 competition on robust reading. InICDAR,

  17. [17]

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations.arXiv preprint arXiv:1602.07332,

  18. [18]

    mplug: Effective and efficient vision-language learning by cross-modal skip-connections

    Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, et al. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022a. Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align b...

  19. [19]

    Swinbert: End-to-end transformers with sparse attention for video captioning.arXiv preprint arXiv:2111.13196, 2021a

    Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. Swinbert: End-to-end transformers with sparse attention for video captioning.arXiv preprint arXiv:2111.13196, 2021a. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá...

  20. [20]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, and Lorenzo Torresani. Vx2text: End-to-end learning of video-based text generation from multimodal inputs. InCVPR, 2021b. Fen Liu, Guanghui Xu, Qi Wu, Qing Du, Wei Jia, and Mingkui Tan. Cascade reasoning network for text-based visual question answering. In Chang Wen Chen, Rita Cucchiara, X...

  21. [21]

    arXiv:2002.06353 , title =

    Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353,

  22. [22]

    Maskocr: Text recognition with masked encoder-decoder pretraining

    Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, and Jingdong Wang. Maskocr: Text recognition with masked encoder-decoder pretraining. arXiv preprint arXiv:2206.00311,

  23. [23]

    Winner team mia at textvqa challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model.arXiv preprint arXiv:2106.15332,

    Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li, Xianbiao Qi, Peng Gao, and Guotong Xie. Winner team mia at textvqa challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model.arXiv preprint arXiv:2106.15332,

  24. [24]

    End-to-end generative pretraining for multimodal video captioning.arXiv preprint arXiv:2201.08264,

    Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. End-to-end generative pretraining for multimodal video captioning.arXiv preprint arXiv:2201.08264,

  25. [25]

    How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

    Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can CLIP benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383,

  26. [26]

    Clip4caption ++: Multi-clip for video caption

    Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, and Dian Li. Clip4caption ++: Multi-clip for video caption. arXiv preprint arXiv:2110.05204,

  27. [27]

    Translating Videos to Natural Language Using Deep Recurrent Neural Networks

    Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks.arXiv preprint arXiv:1412.4729,

  28. [28]

    All in one: Exploring unified video-language pre-training

    46 Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training.arXiv preprint arXiv:2203.07303, 2022a. Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. Controllable video captioning with pos sequence guida...

  29. [29]

    UFO: A unified transformer for vision-language representation learning.arXiv preprint arXiv:2111.10023, 2021a

    Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, and Lijuan Wang. UFO: A unified transformer for vision-language representation learning.arXiv preprint arXiv:2111.10023, 2021a. Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition. InICCV,

  30. [30]

    arXiv preprint arXiv:2202.03052 , year=

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework.arXiv preprint arXiv:2202.03052, 2022b. Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A lar...

  31. [31]

    Visual clues: Bridging vision and language foundations for image paragraph captioning.arXiv preprint arXiv:2206.01843,

    Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, and Michael Zeng. Visual clues: Bridging vision and language foundations for image paragraph captioning.arXiv preprint arXiv:2206.01843,

  32. [32]

    Prob- ing inter-modality: Visual parsing with self-attention for vision-language pre-training.arXiv preprint arXiv:2106.13488, 2021a

    Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo. Prob- ing inter-modality: Visual parsing with self-attention for vision-language pre-training.arXiv preprint arXiv:2106.13488, 2021a. Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo. Probing inter-modality: Visual parsing with s...

  33. [33]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917,

  34. [34]

    Florence: A New Foundation Model for Computer Vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision.arXiv prepri...

  35. [35]

    Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598,

  36. [36]

    Vatex video captioning challenge 2020: Multi-view features and hybrid reward strategies for video captioning.arXiv preprint arXiv:1910.11102,

    Xinxin Zhu, Longteng Guo, Peng Yao, Shichen Lu, Wei Liu, and Jing Liu. Vatex video captioning challenge 2020: Multi-view features and hybrid reward strategies for video captioning.arXiv preprint arXiv:1910.11102,