arxiv: 2205.14100 · v5 · submitted 2022-05-27 · 💻 cs.CV

Recognition: 2 theorem links

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang , Zhengyuan Yang , Xiaowei Hu , Linjie Li , Kevin Lin , Zhe Gan , Zicheng Liu , Ce Liu

show 1 more author

Lijuan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords generative image-to-textvision-language pre-trainingimage captioningvisual question answeringtransformerstate-of-the-artTextCaps

0 comments

The pith

A simplified generative image-to-text transformer unifies vision-language tasks and sets new state-of-the-art results on 12 benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper designs GIT as one image encoder paired with one text decoder trained under a single language modeling objective. By scaling pre-training data volume and model size, the approach aims to match or exceed the results of prior methods that rely on complex multi-module structures and external components such as object detectors or OCR. It reports new state-of-the-art scores across 12 benchmarks for image and video captioning as well as question answering, including the first exceedance of human performance on TextCaps. A sympathetic reader would care because the work tests whether architectural simplicity plus scale can replace specialized pipelines in multimodal understanding.

Core claim

GIT establishes that a single image encoder and text decoder under language modeling, scaled in data and size, unifies vision-language tasks and achieves new state-of-the-art results on 12 benchmarks, surpassing human performance on TextCaps with 138.2 versus 125.5 CIDEr.

What carries the argument

The generative image-to-text transformer consisting of one image encoder and one text decoder trained with a single language modeling task.

Load-bearing premise

That increasing pre-training data volume and model size with a single language-modeling objective on a simple encoder-decoder architecture is enough to surpass prior specialized methods on vision-language tasks.

What would settle it

A model that retains complex multi-modal encoders plus external detectors and OCR but matches the same pre-training data volume and parameter count still falls short of GIT scores on the 12 reported benchmarks.

read the original abstract

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GIT shows a stripped-down single-encoder single-decoder LM setup plus scaling can hit new SOTAs on vision-language benchmarks including beating humans on TextCaps, but the paper leaves open whether the simplification or the extra data and size is doing most of the work.

read the letter

GIT reduces vision-language work to one image encoder feeding a text decoder, all trained under a single language-modeling objective. They scale up both pre-training data and model size, then report new state-of-the-art numbers on twelve benchmarks and the first model to beat human performance on TextCaps (138.2 vs 125.5 CIDEr). They also show the same model can do generation-based image classification and scene text recognition without extra heads. Code is released, which helps anyone who wants to check the numbers or extend the work. The architecture is genuinely simpler than most prior generative models that stacked detectors, taggers, or OCR modules, so the reduction itself is a clear step. The reported results are presented as straightforward empirical outcomes on public benchmarks rather than self-referential fits. That said, the unification claim would land harder if the paper included ablations that hold data volume and parameter count fixed while restoring some of the removed modules or multi-encoder designs. Without those matched-scale comparisons, it remains possible that most of the lift comes from training on more data rather than from the minimal architecture. The TextCaps result is eye-catching, but it would be useful to see where the model still fails on text-in-image cases. This paper is for people who want a practical, scalable baseline for captioning, QA, and related tasks. Readers who care about clean code and strong numbers on standard benchmarks will find it worth reading. It deserves peer review because the performance deltas are large and the setup is simple enough to evaluate directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces GIT, a generative image-to-text transformer consisting of a single image encoder and text decoder trained end-to-end under a unified language modeling objective. By scaling pre-training data volume and model size, the authors claim new state-of-the-art results on 12 vision-language benchmarks (including image/video captioning and VQA), with the model surpassing human performance on TextCaps (138.2 vs. 125.5 CIDEr) for the first time; they also present generation-based schemes for image classification and scene text recognition.

Significance. If the empirical claims hold after verification, the work would be significant for demonstrating that a minimal encoder-decoder architecture under a single LM objective can unify tasks and exceed prior complex models that rely on external modules (detectors, OCR). This would underscore the value of scale over architectural elaboration and simplify the design space for vision-language models. Code release supports reproducibility.

major comments (2)

[Experiments] Experimental sections: the manuscript reports strong benchmark numbers and SOTA claims but provides no ablations or matched-scale re-runs that hold pre-training data volume and parameter count fixed while restoring external modules or multi-encoder designs from prior work. This is load-bearing for the central unification claim, as it leaves open whether gains derive primarily from scaling rather than simplification.
[TextCaps Evaluation] TextCaps results (Table X, CIDEr row): the 138.2 score surpassing human performance is presented without error analysis or ablation on how the image encoder captures scene text in the absence of explicit OCR; this weakens confidence that the architecture generalizes beyond the specific training distribution.

minor comments (2)

[Method] Notation for the image encoder and text decoder could be clarified with explicit equations showing the joint LM loss formulation.
[Figures] Figure captions for benchmark comparisons should include the exact prior methods and their scales for direct visual comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of our simplified architecture. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experimental sections: the manuscript reports strong benchmark numbers and SOTA claims but provides no ablations or matched-scale re-runs that hold pre-training data volume and parameter count fixed while restoring external modules or multi-encoder designs from prior work. This is load-bearing for the central unification claim, as it leaves open whether gains derive primarily from scaling rather than simplification.

Authors: We agree that matched-scale ablations re-implementing prior complex models (with external modules or multi-encoder designs) at identical data volume and parameter count would provide stronger isolation of the simplification benefit. Such experiments are computationally prohibitive at our scale. Our evidence instead rests on consistent outperformance of reported prior SOTA results that used more elaborate designs, combined with our internal scaling curves showing gains from model size and data. We will revise the experimental discussion to explicitly note this limitation and emphasize that the unification claim is supported by the single-architecture results rather than direct head-to-head re-runs. revision: partial
Referee: [TextCaps Evaluation] TextCaps results (Table X, CIDEr row): the 138.2 score surpassing human performance is presented without error analysis or ablation on how the image encoder captures scene text in the absence of explicit OCR; this weakens confidence that the architecture generalizes beyond the specific training distribution.

Authors: We acknowledge the value of error analysis for the TextCaps result. The image encoder is a Vision Transformer pre-trained on large-scale image-text pairs that naturally contain scene text, allowing implicit learning of text recognition within the generative objective. We will add a dedicated error analysis subsection with qualitative examples of generated captions on text-rich images, attention visualizations highlighting text regions, and a breakdown of failure cases. A controlled ablation removing all text from pre-training data is not feasible within our compute budget but would be a useful direction for future work; the current results on TextCaps already demonstrate generalization to scene-text understanding. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical scaling results on public benchmarks

full rationale

The paper describes an empirical pipeline: a simplified single-encoder single-decoder transformer is pre-trained with a language-modeling objective on a large corpus and then fine-tuned/evaluated on standard vision-language benchmarks. No equations, uniqueness theorems, or fitted parameters are presented as 'predictions' that reduce to the inputs by construction. Performance numbers (e.g., TextCaps CIDEr) are reported outcomes of training and testing, not self-definitions or renamings of prior results. Self-citations, if present, are not load-bearing for the central claim; the unification argument rests on the observed benchmark margins rather than any circular reduction. This is a standard scaling experiment whose validity can be checked externally by re-training or re-evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of scaling a simplified transformer under language modeling; no new physical entities are postulated and the main assumptions are standard in the transformer literature.

free parameters (2)

model scale
The paper states that model size was increased but provides no specific parameter count or fitting procedure in the abstract.
pre-training data volume
Increased pre-training data is cited as a performance driver but exact dataset sizes or selection criteria are not detailed in the abstract.

axioms (1)

domain assumption A single language modeling objective on paired image-text data is sufficient to learn unified vision-language representations.
Invoked by the choice to train the entire model end-to-end under one generative task without auxiliary losses or modules.

pith-pipeline@v0.9.0 · 5512 in / 1280 out tokens · 27024 ms · 2026-05-16T20:51:12.001343+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
cs.CV 2026-03 unverdicted novelty 7.0

WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
cs.CV 2023-10 unverdicted novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Language Is Not All You Need: Aligning Perception with Language Models
cs.CL 2023-02 conditional novelty 7.0

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
PaLI: A Jointly-Scaled Multilingual Language-Image Model
cs.CV 2022-09 conditional novelty 7.0

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
cs.SE 2026-05 unverdicted novelty 6.0

VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models
cs.CV 2026-04 unverdicted novelty 6.0

Embedding Arithmetic performs vector operations in the embedding space of T2I models to mitigate bias at inference time, outperforming baselines on diversity while preserving coherence via a new Concept Coherence Score.
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
cs.LG 2026-03 unverdicted novelty 6.0

EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding ...
CogVLM: Visual Expert for Pretrained Language Models
cs.CV 2023-11 conditional novelty 6.0

CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
cs.CV 2023-06 accept novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Sigmoid Loss for Language Image Pre-Training
cs.CV 2023-03 conditional novelty 6.0

SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Text-Guided Multi-Scale Frequency Representation Adaptation
cs.CV 2026-05 unverdicted novelty 5.0

FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
Let ViT Speak: Generative Language-Image Pre-training
cs.CV 2026-05 unverdicted novelty 5.0

GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 18 Pith papers · 11 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeﬀ Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeﬀrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[3]

promod

98 on it. (13) GITTextcaps: The station logo. (14) GITTextcaps: A sign with the numbers 3, 642, 039, 055 on it. (15) GITMJST: republic GITMJST: peperoni GITMJST: jewellery GITMJST: promod GITTextcaps: The word republic is on a black background. (16) GITMJST: republic GITTextcaps: A sign that says republic on it. (17) GITTextcaps: A black sign that says 'p...

work page 2020
[4]

Bluetooth Beats buy beats by dr dre 3 wireless bluetooth headphones

The empire state building the empire state building was built in 1930 and is 1,472 feet tall. Bluetooth Beats buy beats by dr dre 3 wireless bluetooth headphones. (21) (22) (23) (24) (25) Others: banknote, jersey, food, celebrity, character, logo etc. ImId: 23576 ImId: 24480 ImId: 24973 ImId: 27193 ImId: 24407 Nike nfl Arizona Cardinals #32 Tyrann Mathieu...

work page 1930
[5]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Universal captioner: Long-tail vision-and-language model training through content-style separation.arXiv preprint arXiv:2111.12727,

Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, and Rita Cucchiara. Universal captioner: Long-tail vision-and-language model training through content-style separation.arXiv preprint arXiv:2111.12727,

work page arXiv
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

An empirical study of training end-to-end vision-and-language transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, and Michael Zeng. An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv: 2111.02387,

work page arXiv
[9]

Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition

Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. InCVPR, 2021a. Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, and Zicheng Liu. Injecting semantic concepts into end-to-end image captionin...

work page arXiv
[10]

Structured multimodal attentions for textvqa.arXiv preprint arXiv:2006.00753,

Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton van den Hengel, and Qi Wu. Structured multimodal attentions for textvqa.arXiv preprint arXiv:2006.00753,

work page arXiv 2006
[11]

Captioning images taken by people who are blind.arXiv preprint arXiv:2002.08565,

Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind.arXiv preprint arXiv:2002.08565,

work page arXiv 2002
[12]

Scaling up vision-language pre-training for image captioning.arXiv preprint arXiv:2111.12233, 2021a

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning.arXiv preprint arXiv:2111.12233, 2021a. Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. VIVO: surpassing human performance in novel object captioning with visual voca...

work page arXiv
[13]

arXiv preprint arXiv:2004.00849 , year=

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers.arXiv preprint arXiv:2004.00849,

work page arXiv 2004
[14]

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artiﬁcial neural networks for natural scene text recognition.arXiv preprint arXiv:1406.2227,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Icdar 2013 robust reading competition

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. Icdar 2013 robust reading competition. InICDAR,

work page 2013
[16]

Icdar 2015 competition on robust reading

Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar 2015 competition on robust reading. InICDAR,

work page 2015
[17]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations.arXiv preprint arXiv:1602.07332,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

mplug: Eﬀective and eﬃcient vision-language learning by cross-modal skip-connections

Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, et al. mplug: Eﬀective and eﬃcient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022a. Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shaﬁq Joty, Caiming Xiong, and Steven Hoi. Align b...

work page arXiv
[19]

Swinbert: End-to-end transformers with sparse attention for video captioning.arXiv preprint arXiv:2111.13196, 2021a

Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. Swinbert: End-to-end transformers with sparse attention for video captioning.arXiv preprint arXiv:2111.13196, 2021a. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá...

work page arXiv
[20]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, and Lorenzo Torresani. Vx2text: End-to-end learning of video-based text generation from multimodal inputs. InCVPR, 2021b. Fen Liu, Guanghui Xu, Qi Wu, Qing Du, Wei Jia, and Mingkui Tan. Cascade reasoning network for text-based visual question answering. In Chang Wen Chen, Rita Cucchiara, X...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[21]

arXiv:2002.06353 , title =

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A uniﬁed video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353,

work page arXiv 2002
[22]

Maskocr: Text recognition with masked encoder-decoder pretraining

Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, and Jingdong Wang. Maskocr: Text recognition with masked encoder-decoder pretraining. arXiv preprint arXiv:2206.00311,

work page arXiv
[23]

Winner team mia at textvqa challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model.arXiv preprint arXiv:2106.15332,

Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li, Xianbiao Qi, Peng Gao, and Guotong Xie. Winner team mia at textvqa challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model.arXiv preprint arXiv:2106.15332,

work page arXiv 2021
[24]

End-to-end generative pretraining for multimodal video captioning.arXiv preprint arXiv:2201.08264,

Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. End-to-end generative pretraining for multimodal video captioning.arXiv preprint arXiv:2201.08264,

work page arXiv
[25]

How much can clip beneﬁt vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can CLIP beneﬁt vision-and-language tasks?arXiv preprint arXiv:2107.06383,

work page arXiv
[26]

Clip4caption ++: Multi-clip for video caption

Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, and Dian Li. Clip4caption ++: Multi-clip for video caption. arXiv preprint arXiv:2110.05204,

work page arXiv
[27]

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Subhashini Venugopalan, Huijuan Xu, Jeﬀ Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks.arXiv preprint arXiv:1412.4729,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

All in one: Exploring uniﬁed video-language pre-training

46 Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring uniﬁed video-language pre-training.arXiv preprint arXiv:2203.07303, 2022a. Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. Controllable video captioning with pos sequence guida...

work page arXiv 2012
[29]

UFO: A uniﬁed transformer for vision-language representation learning.arXiv preprint arXiv:2111.10023, 2021a

Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, and Lijuan Wang. UFO: A uniﬁed transformer for vision-language representation learning.arXiv preprint arXiv:2111.10023, 2021a. Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition. InICCV,

work page arXiv
[30]

arXiv preprint arXiv:2202.03052 , year=

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework.arXiv preprint arXiv:2202.03052, 2022b. Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A lar...

work page arXiv
[31]

Visual clues: Bridging vision and language foundations for image paragraph captioning.arXiv preprint arXiv:2206.01843,

Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, and Michael Zeng. Visual clues: Bridging vision and language foundations for image paragraph captioning.arXiv preprint arXiv:2206.01843,

work page arXiv
[32]

Prob- ing inter-modality: Visual parsing with self-attention for vision-language pre-training.arXiv preprint arXiv:2106.13488, 2021a

Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo. Prob- ing inter-modality: Visual parsing with self-attention for vision-language pre-training.arXiv preprint arXiv:2106.13488, 2021a. Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo. Probing inter-modality: Visual parsing with s...

work page arXiv
[33]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision.arXiv prepri...

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Vatex video captioning challenge 2020: Multi-view features and hybrid reward strategies for video captioning.arXiv preprint arXiv:1910.11102,

Xinxin Zhu, Longteng Guo, Peng Yao, Shichen Lu, Wei Liu, and Jing Liu. Vatex video captioning challenge 2020: Multi-view features and hybrid reward strategies for video captioning.arXiv preprint arXiv:1910.11102,

work page arXiv 2020